Unverified Commit 3323146e authored by Sylvain Gugger's avatar Sylvain Gugger Committed by GitHub
Browse files

Models doc (#7345)



* Clean up model documentation

* Formatting

* Preparation work

* Long lines

* Main work on rst files

* Cleanup all config files

* Syntax fix

* Clean all tokenizers

* Work on first models

* Models beginning

* FaluBERT

* All PyTorch models

* All models

* Long lines again

* Fixes

* More fixes

* Update docs/source/model_doc/bert.rst
Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>

* Update docs/source/model_doc/electra.rst
Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>

* Last fixes
Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>
parent 58405a52
Tokenizer summary Tokenizer summary
----------------- -----------------------------------------------------------------------------------------------------------------------
In this page, we will have a closer look at tokenization. As we saw in In this page, we will have a closer look at tokenization. As we saw in
:doc:`the preprocessing tutorial <preprocessing>`, tokenizing a text is splitting it into words or subwords, which then :doc:`the preprocessing tutorial <preprocessing>`, tokenizing a text is splitting it into words or subwords, which then
are converted to ids. The second part is pretty straightforward, here we will focus on the first part. More are converted to ids. The second part is pretty straightforward, here we will focus on the first part. More
specifically, we will look at the three main different kinds of tokenizers used in 🤗 Transformers: specifically, we will look at the three main different kinds of tokenizers used in 🤗 Transformers:
:ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>` and :ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>` and
:ref:`SentencePiece <sentencepiece>`, and provide examples of models using each of those. :ref:`SentencePiece <sentencepiece>`, and provide examples of models using each of those.
Note that on each model page, you can look at the documentation of the associated tokenizer to know which of those Note that on each model page, you can look at the documentation of the associated tokenizer to know which of those
algorithms the pretrained model used. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see it's algorithms the pretrained model used. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see it's
using :ref:`WordPiece <wordpiece>`. using :ref:`WordPiece <wordpiece>`.
Introduction to tokenization Introduction to tokenization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Splitting a text in smaller chunks is a task that's harder than it looks, and there are multiple ways of doing it. For Splitting a text in smaller chunks is a task that's harder than it looks, and there are multiple ways of doing it. For
instance, let's look at the sentence "Don't you love 🤗 Transformers? We sure do." A first simple way of tokenizing instance, let's look at the sentence "Don't you love 🤗 Transformers? We sure do." A first simple way of tokenizing
this text is just to split it by spaces, which would give: this text is just to split it by spaces, which would give:
:: .. code-block::
["Don't", "you", "love", "🤗", "Transformers?", "We", "sure", "do."] ["Don't", "you", "love", "🤗", "Transformers?", "We", "sure", "do."]
This is a nice first step, but if we look at the tokens "Transformers?" or "do.", we can see we can do better. Those This is a nice first step, but if we look at the tokens "Transformers?" or "do.", we can see we can do better. Those
will be different than the tokens "Transformers" and "do" for our model, so we should probably take the punctuation will be different than the tokens "Transformers" and "do" for our model, so we should probably take the punctuation
into account. This would give: into account. This would give:
:: .. code-block::
["Don", "'", "t", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."] ["Don", "'", "t", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
which is better already. One thing that is annoying though is how it dealt with "Don't". "Don't" stands for do not, so which is better already. One thing that is annoying though is how it dealt with "Don't". "Don't" stands for do not, so
it should probably be better tokenized as ``["Do", "n't"]``. This is where things start getting more complicated, and it should probably be better tokenized as ``["Do", "n't"]``. This is where things start getting more complicated, and
part of the reason each kind of model has its own tokenizer class. Depending on the rules we apply to split our texts part of the reason each kind of model has its own tokenizer class. Depending on the rules we apply to split our texts
into tokens, we'll get different tokenized versions of the same text. And of course, a given pretrained model won't into tokens, we'll get different tokenized versions of the same text. And of course, a given pretrained model won't
perform properly if you don't use the exact same rules as the persons who pretrained it. perform properly if you don't use the exact same rules as the persons who pretrained it.
`spaCy <https://spacy.io/>`__ and `Moses <http://www.statmt.org/moses/?n=Development.GetStarted>`__ are two popular `spaCy <https://spacy.io/>`__ and `Moses <http://www.statmt.org/moses/?n=Development.GetStarted>`__ are two popular
rule-based tokenizers. On the text above, they'd output something like: rule-based tokenizers. On the text above, they'd output something like:
:: .. code-block::
["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."] ["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
Space/punctuation-tokenization and rule-based tokenization are both examples of word tokenization, which is splitting a Space/punctuation-tokenization and rule-based tokenization are both examples of word tokenization, which is splitting a
sentence into words. While it's the most intuitive way to separate texts in smaller chunks, it can have a problem when sentence into words. While it's the most intuitive way to separate texts in smaller chunks, it can have a problem when
you have a huge corpus: it usually yields a very big vocabulary (the set of all unique tokens used). you have a huge corpus: it usually yields a very big vocabulary (the set of all unique tokens used).
:doc:`Transformer XL <model_doc/transformerxl>` for instance uses space/punctuation-tokenization, and has a vocabulary :doc:`Transformer XL <model_doc/transformerxl>` for instance uses space/punctuation-tokenization, and has a vocabulary
size of 267,735! size of 267,735!
A huge vocabulary size means a huge embedding matrix at the start of the model, which will cause memory problems. A huge vocabulary size means a huge embedding matrix at the start of the model, which will cause memory problems.
TransformerXL deals with it by using a special kind of embeddings called adaptive embeddings, but in general, TransformerXL deals with it by using a special kind of embeddings called adaptive embeddings, but in general,
transformers models rarely have a vocabulary size greater than 50,000, especially if they are trained on a single transformers models rarely have a vocabulary size greater than 50,000, especially if they are trained on a single
language. language.
So if tokenizing on words is unsatisfactory, we could go on the opposite direction and simply tokenize on characters. So if tokenizing on words is unsatisfactory, we could go on the opposite direction and simply tokenize on characters.
While it's very simple and would save a lot of memory, this doesn't allow the model to learn representations of texts While it's very simple and would save a lot of memory, this doesn't allow the model to learn representations of texts
as meaningful as when using a word tokenization, leading to a loss of performance. So to get the best of both worlds, as meaningful as when using a word tokenization, leading to a loss of performance. So to get the best of both worlds,
all transformers models use a hybrid between word-level and character-level tokenization called subword tokenization. all transformers models use a hybrid between word-level and character-level tokenization called subword tokenization.
Subword tokenization Subword tokenization
^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Subword tokenization algorithms rely on the principle that most common words should be left as is, but rare words Subword tokenization algorithms rely on the principle that most common words should be left as is, but rare words
should be decomposed in meaningful subword units. For instance "annoyingly" might be considered a rare word and should be decomposed in meaningful subword units. For instance "annoyingly" might be considered a rare word and
decomposed as "annoying" and "ly". This is especially useful in agglutinative languages such as Turkish, where you can decomposed as "annoying" and "ly". This is especially useful in agglutinative languages such as Turkish, where you can
form (almost) arbitrarily long complex words by stringing together some subwords. form (almost) arbitrarily long complex words by stringing together some subwords.
This allows the model to keep a reasonable vocabulary while still learning useful representations for common words or This allows the model to keep a reasonable vocabulary while still learning useful representations for common words or
subwords. This also enables the model to process words it has never seen before, by decomposing them into subwords. This also enables the model to process words it has never seen before, by decomposing them into
subwords it knows. For instance, the base :class:`~transformers.BertTokenizer` will tokenize "I have a new GPU!" like subwords it knows. For instance, the base :class:`~transformers.BertTokenizer` will tokenize "I have a new GPU!" like
this: this:
.. code-block:: .. code-block::
>>> from transformers import BertTokenizer >>> from transformers import BertTokenizer
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> tokenizer.tokenize("I have a new GPU!") >>> tokenizer.tokenize("I have a new GPU!")
['i', 'have', 'a', 'new', 'gp', '##u', '!'] ['i', 'have', 'a', 'new', 'gp', '##u', '!']
Since we are considering the uncased model, the sentence was lowercased first. Then all the words were present in the Since we are considering the uncased model, the sentence was lowercased first. Then all the words were present in the
vocabulary of the tokenizer, except for "gpu", so the tokenizer split it in subwords it knows: "gp" and "##u". The "##" vocabulary of the tokenizer, except for "gpu", so the tokenizer split it in subwords it knows: "gp" and "##u". The "##"
means that the rest of the token should be attached to the previous one, without space (for when we need to decode means that the rest of the token should be attached to the previous one, without space (for when we need to decode
predictions and reverse the tokenization). predictions and reverse the tokenization).
Another example is when we use the base :class:`~transformers.XLNetTokenizer` to tokenize our previous text: Another example is when we use the base :class:`~transformers.XLNetTokenizer` to tokenize our previous text:
.. code-block:: .. code-block::
>>> from transformers import XLNetTokenizer >>> from transformers import XLNetTokenizer
>>> tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased') >>> tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
>>> tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.") >>> tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.")
['▁Don', "'", 't', '▁you', '▁love', '▁', '🤗', '▁', 'Transform', 'ers', '?', '▁We', '▁sure', '▁do', '.'] ['▁Don', "'", 't', '▁you', '▁love', '▁', '🤗', '▁', 'Transform', 'ers', '?', '▁We', '▁sure', '▁do', '.']
We'll get back to the meaning of those '▁' when we look at :ref:`SentencePiece <sentencepiece>` but you can see We'll get back to the meaning of those '▁' when we look at :ref:`SentencePiece <sentencepiece>` but you can see
Transformers has been split into "Transform" and "ers". Transformers has been split into "Transform" and "ers".
Let's now look at how the different subword tokenization algorithms work. Note that they all rely on some form of Let's now look at how the different subword tokenization algorithms work. Note that they all rely on some form of
training which is usually done on the corpus the corresponding model will be trained on. training which is usually done on the corpus the corresponding model will be trained on.
.. _byte-pair-encoding: .. _byte-pair-encoding:
Byte-Pair Encoding Byte-Pair Encoding
~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Byte-Pair Encoding was introduced in `this paper <https://arxiv.org/abs/1508.07909>`__. It relies on a pretokenizer Byte-Pair Encoding was introduced in `this paper <https://arxiv.org/abs/1508.07909>`__. It relies on a pretokenizer
splitting the training data into words, which can be a simple space tokenization splitting the training data into words, which can be a simple space tokenization
(:doc:`GPT-2 <model_doc/gpt2>` and :doc:`Roberta <model_doc/roberta>` uses this for instance) or a rule-based tokenizer (:doc:`GPT-2 <model_doc/gpt2>` and :doc:`Roberta <model_doc/roberta>` uses this for instance) or a rule-based tokenizer
(:doc:`XLM <model_doc/xlm>` use Moses for most languages, as does :doc:`FlauBERT <model_doc/flaubert>`), (:doc:`XLM <model_doc/xlm>` use Moses for most languages, as does :doc:`FlauBERT <model_doc/flaubert>`),
:doc:`GPT <model_doc/gpt>` uses Spacy and ftfy, and counts the frequency of each word in the training corpus. :doc:`GPT <model_doc/gpt>` uses Spacy and ftfy, and counts the frequency of each word in the training corpus.
It then begins from the list of all characters, and will learn merge rules to form a new token from two symbols in the It then begins from the list of all characters, and will learn merge rules to form a new token from two symbols in the
vocabulary until it has learned a vocabulary of the desired size (this is a hyperparameter to pick). vocabulary until it has learned a vocabulary of the desired size (this is a hyperparameter to pick).
Let's say that after the pre-tokenization we have the following words (the number indicating the frequency of each Let's say that after the pre-tokenization we have the following words (the number indicating the frequency of each
word): word):
:: .. code-block::
('hug', 10), ('pug', 5), ('pun', 12), ('bun', 4), ('hugs', 5) ('hug', 10), ('pug', 5), ('pun', 12), ('bun', 4), ('hugs', 5)
Then the base vocabulary is ['b', 'g', 'h', 'n', 'p', 's', 'u'] and all our words are first split by character: Then the base vocabulary is ['b', 'g', 'h', 'n', 'p', 's', 'u'] and all our words are first split by character:
:: .. code-block::
('h' 'u' 'g', 10), ('p' 'u' 'g', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'u' 'g' 's', 5) ('h' 'u' 'g', 10), ('p' 'u' 'g', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'u' 'g' 's', 5)
We then take each pair of symbols and look at the most frequent. For instance 'hu' is present `10 + 5 = 15` times (10 We then take each pair of symbols and look at the most frequent. For instance 'hu' is present `10 + 5 = 15` times (10
times in the 10 occurrences of 'hug', 5 times in the 5 occurrences of 'hugs'). The most frequent here is 'ug', present times in the 10 occurrences of 'hug', 5 times in the 5 occurrences of 'hugs'). The most frequent here is 'ug', present
`10 + 5 + 5 = 20` times in total. So the first merge rule the tokenizer learns is to group all 'u' and 'g' together `10 + 5 + 5 = 20` times in total. So the first merge rule the tokenizer learns is to group all 'u' and 'g' together
then it adds 'ug' to the vocabulary. Our corpus then becomes then it adds 'ug' to the vocabulary. Our corpus then becomes
:: .. code-block::
('h' 'ug', 10), ('p' 'ug', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'ug' 's', 5) ('h' 'ug', 10), ('p' 'ug', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'ug' 's', 5)
and we continue by looking at the next most common pair of symbols. It's 'un', present 16 times, so we merge those two and we continue by looking at the next most common pair of symbols. It's 'un', present 16 times, so we merge those two
and add 'un' to the vocabulary. Then it's 'hug' (as 'h' + 'ug'), present 15 times, so we merge those two and add 'hug' and add 'un' to the vocabulary. Then it's 'hug' (as 'h' + 'ug'), present 15 times, so we merge those two and add 'hug'
to the vocabulary. to the vocabulary.
At this stage, the vocabulary is ``['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug']`` and our corpus is At this stage, the vocabulary is ``['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug']`` and our corpus is
represented as represented as
:: .. code-block::
('hug', 10), ('p' 'ug', 5), ('p' 'un', 12), ('b' 'un', 4), ('hug' 's', 5) ('hug', 10), ('p' 'ug', 5), ('p' 'un', 12), ('b' 'un', 4), ('hug' 's', 5)
If we stop there, the tokenizer can apply the rules it learned to new words (as long as they don't contain characters that If we stop there, the tokenizer can apply the rules it learned to new words (as long as they don't contain characters that
were not in the base vocabulary). For instance 'bug' would be tokenized as ``['b', 'ug']`` but mug would be tokenized as were not in the base vocabulary). For instance 'bug' would be tokenized as ``['b', 'ug']`` but mug would be tokenized as
``['<unk>', 'ug']`` since the 'm' is not in the base vocabulary. This doesn't happen to letters in general (since the ``['<unk>', 'ug']`` since the 'm' is not in the base vocabulary. This doesn't happen to letters in general (since the
base corpus uses all of them), but to special characters like emojis. base corpus uses all of them), but to special characters like emojis.
As we said before, the vocabulary size (which is the base vocabulary size + the number of merges) is a hyperparameter As we said before, the vocabulary size (which is the base vocabulary size + the number of merges) is a hyperparameter
to choose. For instance :doc:`GPT <model_doc/gpt>` has a vocabulary size of 40,478 since they have 478 base characters to choose. For instance :doc:`GPT <model_doc/gpt>` has a vocabulary size of 40,478 since they have 478 base characters
and chose to stop the training of the tokenizer at 40,000 merges. and chose to stop the training of the tokenizer at 40,000 merges.
Byte-level BPE Byte-level BPE
^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To deal with the fact the base vocabulary needs to get all base characters, which can be quite big if one allows for To deal with the fact the base vocabulary needs to get all base characters, which can be quite big if one allows for
all unicode characters, the all unicode characters, the
`GPT-2 paper <https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__ `GPT-2 paper <https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__
introduces a clever trick, which is to use bytes as the base vocabulary (which gives a size of 256). With some introduces a clever trick, which is to use bytes as the base vocabulary (which gives a size of 256). With some
additional rules to deal with punctuation, this manages to be able to tokenize every text without needing an unknown additional rules to deal with punctuation, this manages to be able to tokenize every text without needing an unknown
token. For instance, the :doc:`GPT-2 model <model_doc/gpt>` has a vocabulary size of 50,257, which corresponds to the token. For instance, the :doc:`GPT-2 model <model_doc/gpt>` has a vocabulary size of 50,257, which corresponds to the
256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges. 256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges.
.. _wordpiece: .. _wordpiece:
WordPiece WordPiece
========= =======================================================================================================================
WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>` (as well as WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>` (as well as
:doc:`DistilBERT <model_doc/distilbert>` and :doc:`Electra <model_doc/electra>`) and was outlined in :doc:`DistilBERT <model_doc/distilbert>` and :doc:`Electra <model_doc/electra>`) and was outlined in
`this paper <https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__. It relies `this paper <https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__. It relies
on the same base as BPE, which is to initialize the vocabulary to every character present in the corpus and on the same base as BPE, which is to initialize the vocabulary to every character present in the corpus and
progressively learn a given number of merge rules, the difference is that it doesn't choose the pair that is the most progressively learn a given number of merge rules, the difference is that it doesn't choose the pair that is the most
frequent but the one that will maximize the likelihood on the corpus once merged. frequent but the one that will maximize the likelihood on the corpus once merged.
What does this mean? Well, in the previous example, it means we would only merge 'u' and 'g' if the probability of What does this mean? Well, in the previous example, it means we would only merge 'u' and 'g' if the probability of
having 'ug' divided by the probability of having 'u' then 'g' is greater than for any other pair of symbols. It's having 'ug' divided by the probability of having 'u' then 'g' is greater than for any other pair of symbols. It's
subtly different from what BPE does in the sense that it evaluates what it "loses" by merging two symbols and makes subtly different from what BPE does in the sense that it evaluates what it "loses" by merging two symbols and makes
sure it's `worth it`. sure it's `worth it`.
.. _unigram: .. _unigram:
Unigram Unigram
======= =======================================================================================================================
Unigram is a subword tokenization algorithm introduced in `this paper <https://arxiv.org/pdf/1804.10959.pdf>`__. Unigram is a subword tokenization algorithm introduced in `this paper <https://arxiv.org/pdf/1804.10959.pdf>`__.
Instead of starting with a group of base symbols and learning merges with some rule, like BPE or WordPiece, it starts Instead of starting with a group of base symbols and learning merges with some rule, like BPE or WordPiece, it starts
from a large vocabulary (for instance, all pretokenized words and the most common substrings) that it will trim down from a large vocabulary (for instance, all pretokenized words and the most common substrings) that it will trim down
progressively. It's not used directly for any of the pretrained models in the library, but it's used in conjunction progressively. It's not used directly for any of the pretrained models in the library, but it's used in conjunction
with :ref:`SentencePiece <sentencepiece>`. with :ref:`SentencePiece <sentencepiece>`.
More specifically, at a given step, unigram computes a loss from the corpus we have and the current vocabulary, then, More specifically, at a given step, unigram computes a loss from the corpus we have and the current vocabulary, then,
for each subword, evaluate how much the loss would augment if the subword was removed from the vocabulary. It then for each subword, evaluate how much the loss would augment if the subword was removed from the vocabulary. It then
sorts the subwords by this quantity (that represents how worse the loss becomes if the token is removed) and removes sorts the subwords by this quantity (that represents how worse the loss becomes if the token is removed) and removes
all the worst p tokens (for instance p could be 10% or 20%). It then repeats the process until the vocabulary has all the worst p tokens (for instance p could be 10% or 20%). It then repeats the process until the vocabulary has
reached the desired size, always keeping the base characters (to be able to tokenize any word written with them, like reached the desired size, always keeping the base characters (to be able to tokenize any word written with them, like
BPE or WordPiece). BPE or WordPiece).
Contrary to BPE and WordPiece that work out rules in a certain order that you can then apply in the same order when Contrary to BPE and WordPiece that work out rules in a certain order that you can then apply in the same order when
tokenizing new text, Unigram will have several ways of tokenizing a new text. For instance, if it ends up with the tokenizing new text, Unigram will have several ways of tokenizing a new text. For instance, if it ends up with the
vocabulary vocabulary
:: .. code-block::
['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug'] ['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug']
we had before, it could tokenize "hugs" as ``['hug', 's']``, ``['h', 'ug', 's']`` or ``['h', 'u', 'g', 's']``. So which we had before, it could tokenize "hugs" as ``['hug', 's']``, ``['h', 'ug', 's']`` or ``['h', 'u', 'g', 's']``. So which
one choose? On top of saving the vocabulary, the trained tokenizer will save the probability of each token in the one choose? On top of saving the vocabulary, the trained tokenizer will save the probability of each token in the
training corpus. You can then give a probability to each tokenization (which is the product of the probabilities of the training corpus. You can then give a probability to each tokenization (which is the product of the probabilities of the
tokens forming it) and pick the most likely one (or if you want to apply some data augmentation, you could sample one tokens forming it) and pick the most likely one (or if you want to apply some data augmentation, you could sample one
of the tokenization according to their probabilities). of the tokenization according to their probabilities).
Those probabilities define the loss that trains the tokenizer: if our corpus consists of the Those probabilities define the loss that trains the tokenizer: if our corpus consists of the
words :math:`x_{1}, \dots, x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all possible words :math:`x_{1}, \dots, x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all possible
tokenizations of :math:`x_{i}` (with the current vocabulary), then the loss is defined as tokenizations of :math:`x_{i}` (with the current vocabulary), then the loss is defined as
.. math:: .. math::
\mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right ) \mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )
.. _sentencepiece: .. _sentencepiece:
SentencePiece SentencePiece
============= =======================================================================================================================
All the methods we have been looking at so far required some form of pretokenization, which has a central problem: not All the methods we have been looking at so far required some form of pretokenization, which has a central problem: not
all languages use spaces to separate words. This is a problem :doc:`XLM <model_doc/xlm>` solves by using specific all languages use spaces to separate words. This is a problem :doc:`XLM <model_doc/xlm>` solves by using specific
pretokenizers for each of those languages (in this case, Chinese, Japanese and Thai). To solve this problem, pretokenizers for each of those languages (in this case, Chinese, Japanese and Thai). To solve this problem,
SentencePiece (introduced in `this paper <https://arxiv.org/pdf/1808.06226.pdf>`__) treats the input as a raw stream, SentencePiece (introduced in `this paper <https://arxiv.org/pdf/1808.06226.pdf>`__) treats the input as a raw stream,
includes the space in the set of characters to use, then uses BPE or unigram to construct the appropriate vocabulary. includes the space in the set of characters to use, then uses BPE or unigram to construct the appropriate vocabulary.
That's why in the example we saw before using :class:`~transformers.XLNetTokenizer` (which uses SentencePiece), we had That's why in the example we saw before using :class:`~transformers.XLNetTokenizer` (which uses SentencePiece), we had
the '▁' character, that represents space. Decoding a tokenized text is then super easy: we just have to concatenate the '▁' character, that represents space. Decoding a tokenized text is then super easy: we just have to concatenate
all of them together and replace '▁' with space. all of them together and replace '▁' with space.
All transformers models in the library that use SentencePiece use it with unigram. Examples of models using it are All transformers models in the library that use SentencePiece use it with unigram. Examples of models using it are
:doc:`ALBERT <model_doc/albert>`, :doc:`XLNet <model_doc/xlnet>` or the :doc:`Marian framework <model_doc/marian>`. :doc:`ALBERT <model_doc/albert>`, :doc:`XLNet <model_doc/xlnet>` or the :doc:`Marian framework <model_doc/marian>`.
Training and fine-tuning Training and fine-tuning
======================== =======================================================================================================================
Model classes in 🤗 Transformers are designed to be compatible with native Model classes in 🤗 Transformers are designed to be compatible with native
PyTorch and TensorFlow 2 and can be used seemlessly with either. In this PyTorch and TensorFlow 2 and can be used seemlessly with either. In this
...@@ -24,7 +24,7 @@ Sections: ...@@ -24,7 +24,7 @@ Sections:
.. _pytorch: .. _pytorch:
Fine-tuning in native PyTorch Fine-tuning in native PyTorch
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Model classes in 🤗 Transformers that don't begin with ``TF`` are Model classes in 🤗 Transformers that don't begin with ``TF`` are
`PyTorch Modules <https://pytorch.org/docs/master/generated/torch.nn.Module.html>`_, `PyTorch Modules <https://pytorch.org/docs/master/generated/torch.nn.Module.html>`_,
...@@ -141,7 +141,7 @@ with features like mixed precision and easy tensorboard logging. ...@@ -141,7 +141,7 @@ with features like mixed precision and easy tensorboard logging.
Freezing the encoder Freezing the encoder
-------------------- -----------------------------------------------------------------------------------------------------------------------
In some cases, you might be interested in keeping the weights of the In some cases, you might be interested in keeping the weights of the
pre-trained encoder frozen and optimizing only the weights of the head pre-trained encoder frozen and optimizing only the weights of the head
...@@ -158,7 +158,7 @@ submodule on any task-specific model in the library: ...@@ -158,7 +158,7 @@ submodule on any task-specific model in the library:
.. _tensorflow: .. _tensorflow:
Fine-tuning in native TensorFlow 2 Fine-tuning in native TensorFlow 2
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Models can also be trained natively in TensorFlow 2. Just as with PyTorch, Models can also be trained natively in TensorFlow 2. Just as with PyTorch,
TensorFlow models can be instantiated with TensorFlow models can be instantiated with
...@@ -210,7 +210,7 @@ can even save the model and then reload it as a PyTorch model (or vice-versa): ...@@ -210,7 +210,7 @@ can even save the model and then reload it as a PyTorch model (or vice-versa):
.. _trainer: .. _trainer:
Trainer Trainer
^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
We also provide a simple but feature-complete training and evaluation We also provide a simple but feature-complete training and evaluation
interface through :func:`~transformers.Trainer` and interface through :func:`~transformers.Trainer` and
...@@ -303,7 +303,7 @@ launching tensorboard in your specified ``logging_dir`` directory. ...@@ -303,7 +303,7 @@ launching tensorboard in your specified ``logging_dir`` directory.
.. _additional-resources: .. _additional-resources:
Additional resources Additional resources
^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- `A lightweight colab demo <https://colab.research.google.com/drive/1-JIJlao4dI-Ilww_NnTc0rxtp-ymgDgM?usp=sharing>`_ - `A lightweight colab demo <https://colab.research.google.com/drive/1-JIJlao4dI-Ilww_NnTc0rxtp-ymgDgM?usp=sharing>`_
which uses ``Trainer`` for IMDb sentiment classification. which uses ``Trainer`` for IMDb sentiment classification.
......
...@@ -32,54 +32,55 @@ ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = { ...@@ -32,54 +32,55 @@ ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class AlbertConfig(PretrainedConfig): class AlbertConfig(PretrainedConfig):
r""" r"""
This is the configuration class to store the configuration of a :class:`~transformers.AlbertModel`. This is the configuration class to store the configuration of a :class:`~transformers.AlbertModel` or a
It is used to instantiate an ALBERT model according to the specified arguments, defining the model :class:`~transformers.TFAlbertModel`. It is used to instantiate an ALBERT model according to the specified
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
the ALBERT `xxlarge <https://huggingface.co/albert-xxlarge-v2>`__ architecture. configuration to that of the ALBERT `xxlarge <https://huggingface.co/albert-xxlarge-v2>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig` to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
for more information. for more information.
Args: Args:
vocab_size (:obj:`int`, optional, defaults to 30000): vocab_size (:obj:`int`, `optional`, defaults to 30000):
Vocabulary size of the ALBERT model. Defines the different tokens that Vocabulary size of the ALBERT model. Defines the number of different tokens that can be represented by the
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.AlbertModel`. :obj:`inputs_ids` passed when calling :class:`~transformers.AlbertModel` or
embedding_size (:obj:`int`, optional, defaults to 128): :class:`~transformers.TFAlbertModel`.
embedding_size (:obj:`int`, `optional`, defaults to 128):
Dimensionality of vocabulary embeddings. Dimensionality of vocabulary embeddings.
hidden_size (:obj:`int`, optional, defaults to 4096): hidden_size (:obj:`int`, `optional`, defaults to 4096):
Dimensionality of the encoder layers and the pooler layer. Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (:obj:`int`, optional, defaults to 12): num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
Number of hidden layers in the Transformer encoder. Number of hidden layers in the Transformer encoder.
num_hidden_groups (:obj:`int`, optional, defaults to 1): num_hidden_groups (:obj:`int`, `optional`, defaults to 1):
Number of groups for the hidden layers, parameters in the same group are shared. Number of groups for the hidden layers, parameters in the same group are shared.
num_attention_heads (:obj:`int`, optional, defaults to 64): num_attention_heads (:obj:`int`, `optional`, defaults to 64):
Number of attention heads for each attention layer in the Transformer encoder. Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (:obj:`int`, optional, defaults to 16384): intermediate_size (:obj:`int`, `optional`, defaults to 16384):
The dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. The dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
inner_group_num (:obj:`int`, optional, defaults to 1): inner_group_num (:obj:`int`, `optional`, defaults to 1):
The number of inner repetition of attention and ffn. The number of inner repetition of attention and ffn.
hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "gelu_new"): hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu_new"`):
The non-linear activation function (function or string) in the encoder and pooler. The non-linear activation function (function or string) in the encoder and pooler.
If string, "gelu", "relu", "swish" and "gelu_new" are supported. If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
hidden_dropout_prob (:obj:`float`, optional, defaults to 0): hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0): attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0):
The dropout ratio for the attention probabilities. The dropout ratio for the attention probabilities.
max_position_embeddings (:obj:`int`, optional, defaults to 512): max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
The maximum sequence length that this model might ever be used with. Typically set this to something The maximum sequence length that this model might ever be used with. Typically set this to something
large (e.g., 512 or 1024 or 2048). large (e.g., 512 or 1024 or 2048).
type_vocab_size (:obj:`int`, optional, defaults to 2): type_vocab_size (:obj:`int`, `optional`, defaults to 2):
The vocabulary size of the `token_type_ids` passed into :class:`~transformers.AlbertModel`. The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.AlbertModel` or
initializer_range (:obj:`float`, optional, defaults to 0.02): :class:`~transformers.TFAlbertModel`.
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (:obj:`float`, optional, defaults to 1e-12): layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
The epsilon used by the layer normalization layers. The epsilon used by the layer normalization layers.
classifier_dropout_prob (:obj:`float`, optional, defaults to 0.1): classifier_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout ratio for attached classifiers. The dropout ratio for attached classifiers.
Example:: Examples::
>>> from transformers import AlbertConfig, AlbertModel >>> from transformers import AlbertConfig, AlbertModel
>>> # Initializing an ALBERT-xxlarge style configuration >>> # Initializing an ALBERT-xxlarge style configuration
......
...@@ -50,10 +50,10 @@ BERT_PRETRAINED_CONFIG_ARCHIVE_MAP = { ...@@ -50,10 +50,10 @@ BERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class BertConfig(PretrainedConfig): class BertConfig(PretrainedConfig):
r""" r"""
This is the configuration class to store the configuration of a :class:`~transformers.BertModel`. This is the configuration class to store the configuration of a :class:`~transformers.BertModel` or a
It is used to instantiate an BERT model according to the specified arguments, defining the model :class:`~transformers.TFBertModel`. It is used to instantiate a BERT model according to the specified
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
the BERT `bert-base-uncased <https://huggingface.co/bert-base-uncased>`__ architecture. configuration to that of the BERT `bert-base-uncased <https://huggingface.co/bert-base-uncased>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig` to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
...@@ -61,37 +61,39 @@ class BertConfig(PretrainedConfig): ...@@ -61,37 +61,39 @@ class BertConfig(PretrainedConfig):
Args: Args:
vocab_size (:obj:`int`, optional, defaults to 30522): vocab_size (:obj:`int`, `optional`, defaults to 30522):
Vocabulary size of the BERT model. Defines the different tokens that Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.BertModel`. :obj:`inputs_ids` passed when calling :class:`~transformers.BertModel` or
hidden_size (:obj:`int`, optional, defaults to 768): :class:`~transformers.TFBertModel`.
hidden_size (:obj:`int`, `optional`, defaults to 768):
Dimensionality of the encoder layers and the pooler layer. Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (:obj:`int`, optional, defaults to 12): num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
Number of hidden layers in the Transformer encoder. Number of hidden layers in the Transformer encoder.
num_attention_heads (:obj:`int`, optional, defaults to 12): num_attention_heads (:obj:`int`, `optional`, defaults to 12):
Number of attention heads for each attention layer in the Transformer encoder. Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (:obj:`int`, optional, defaults to 3072): intermediate_size (:obj:`int`, `optional`, defaults to 3072):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "gelu"): hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. The non-linear activation function (function or string) in the encoder and pooler.
If string, "gelu", "relu", "swish" and "gelu_new" are supported. If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1): hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler. The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1): attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout ratio for the attention probabilities. The dropout ratio for the attention probabilities.
max_position_embeddings (:obj:`int`, optional, defaults to 512): max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
The maximum sequence length that this model might ever be used with. The maximum sequence length that this model might ever be used with.
Typically set this to something large just in case (e.g., 512 or 1024 or 2048). Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (:obj:`int`, optional, defaults to 2): type_vocab_size (:obj:`int`, `optional`, defaults to 2):
The vocabulary size of the `token_type_ids` passed into :class:`~transformers.BertModel`. The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.BertModel` or
initializer_range (:obj:`float`, optional, defaults to 0.02): :class:`~transformers.TFBertModel`.
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (:obj:`float`, optional, defaults to 1e-12): layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
The epsilon used by the layer normalization layers. The epsilon used by the layer normalization layers.
gradient_checkpointing (:obj:`bool`, optional, defaults to :obj:`False`): gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
If True, use gradient checkpointing to save memory at the expense of slower backward pass. If True, use gradient checkpointing to save memory at the expense of slower backward pass.
Example:: Examples::
>>> from transformers import BertModel, BertConfig >>> from transformers import BertModel, BertConfig
......
...@@ -19,18 +19,18 @@ from .configuration_utils import PretrainedConfig ...@@ -19,18 +19,18 @@ from .configuration_utils import PretrainedConfig
class BertGenerationConfig(PretrainedConfig): class BertGenerationConfig(PretrainedConfig):
r""" r"""
This is the configuration class to store the configuration of a :class:`~transformers.BertGenerationPreTrainedModel`. This is the configuration class to store the configuration of a
It is used to instantiate a BertGenerationConfig model according to the specified arguments, defining the model architecture. :class:`~transformers.BertGenerationPreTrainedModel`. It is used to instantiate a BertGeneration model according to
the specified arguments, defining the model architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig` to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
for more information. for more information.
Args: Args:
vocab_size (:obj:`int`, `optional`, defaults to 50358): vocab_size (:obj:`int`, `optional`, defaults to 50358):
Vocabulary size of the BertGeneration model. Defines the different tokens that Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.BertGeneration`. :obj:`inputs_ids` passed when calling :class:`~transformers.BertGeneration`.
hidden_size (:obj:`int`, `optional`, defaults to 1024): hidden_size (:obj:`int`, `optional`, defaults to 1024):
Dimensionality of the encoder layers and the pooler layer. Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (:obj:`int`, `optional`, defaults to 24): num_hidden_layers (:obj:`int`, `optional`, defaults to 24):
...@@ -38,7 +38,7 @@ class BertGenerationConfig(PretrainedConfig): ...@@ -38,7 +38,7 @@ class BertGenerationConfig(PretrainedConfig):
num_attention_heads (:obj:`int`, `optional`, defaults to 16): num_attention_heads (:obj:`int`, `optional`, defaults to 16):
Number of attention heads for each attention layer in the Transformer encoder. Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (:obj:`int`, `optional`, defaults to 3072): intermediate_size (:obj:`int`, `optional`, defaults to 3072):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. Dimensionality of the "intermediate" (often called feed-forward) layer in the Transformer encoder.
hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`): hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. The non-linear activation function (function or string) in the encoder and pooler.
If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported. If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
...@@ -56,7 +56,7 @@ class BertGenerationConfig(PretrainedConfig): ...@@ -56,7 +56,7 @@ class BertGenerationConfig(PretrainedConfig):
gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`): gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
If :obj:`True`, use gradient checkpointing to save memory at the expense of slower backward pass. If :obj:`True`, use gradient checkpointing to save memory at the expense of slower backward pass.
Example:: Examples::
>>> from transformers import BertGenerationConfig, BertGenerationEncoder >>> from transformers import BertGenerationConfig, BertGenerationEncoder
......
...@@ -25,44 +25,45 @@ CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP = {"ctrl": "https://s3.amazonaws.com/models.h ...@@ -25,44 +25,45 @@ CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP = {"ctrl": "https://s3.amazonaws.com/models.h
class CTRLConfig(PretrainedConfig): class CTRLConfig(PretrainedConfig):
""" """
This is the configuration class to store the configuration of a :class:`~transformers.CTRLModel`. This is the configuration class to store the configuration of a :class:`~transformers.CTRLModel` or a
It is used to instantiate an CTRL model according to the specified arguments, defining the model :class:`~transformers.TFCTRLModel`. It is used to instantiate a CTRL model according to the specified
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
the `ctrl <https://huggingface.co/ctrl>`__ architecture from SalesForce. configuration to that of the `ctrl <https://huggingface.co/ctrl>`__ architecture from SalesForce.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig` to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
for more information. for more information.
Args: Args:
vocab_size (:obj:`int`, optional, defaults to 246534): vocab_size (:obj:`int`, `optional`, defaults to 246534):
Vocabulary size of the CTRL model. Defines the different tokens that Vocabulary size of the CTRL model. Defines the number of different tokens that can be represented by the
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.CTRLModel`. :obj:`inputs_ids` passed when calling :class:`~transformers.CTRLModel` or
n_positions (:obj:`int`, optional, defaults to 256): :class:`~transformers.TFCTRLModel`.
n_positions (:obj:`int`, `optional`, defaults to 256):
The maximum sequence length that this model might ever be used with. The maximum sequence length that this model might ever be used with.
Typically set this to something large just in case (e.g., 512 or 1024 or 2048). Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
n_ctx (:obj:`int`, optional, defaults to 256): n_ctx (:obj:`int`, `optional`, defaults to 256):
Dimensionality of the causal mask (usually same as n_positions). Dimensionality of the causal mask (usually same as n_positions).
n_embd (:obj:`int`, optional, defaults to 1280): n_embd (:obj:`int`, `optional`, defaults to 1280):
Dimensionality of the embeddings and hidden states. Dimensionality of the embeddings and hidden states.
dff (:obj:`int`, optional, defaults to 8192): dff (:obj:`int`, `optional`, defaults to 8192):
Dimensionality of the inner dimension of the FFN. Dimensionality of the inner dimension of the feed forward networks (FFN).
n_layer (:obj:`int`, optional, defaults to 48): n_layer (:obj:`int`, `optional`, defaults to 48):
Number of hidden layers in the Transformer encoder. Number of hidden layers in the Transformer encoder.
n_head (:obj:`int`, optional, defaults to 16): n_head (:obj:`int`, `optional`, defaults to 16):
Number of attention heads for each attention layer in the Transformer encoder. Number of attention heads for each attention layer in the Transformer encoder.
resid_pdrop (:obj:`float`, optional, defaults to 0.1): resid_pdrop (:obj:`float`, `optional`, defaults to 0.1):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
embd_pdrop (:obj:`int`, optional, defaults to 0.1): embd_pdrop (:obj:`int`, `optional`, defaults to 0.1):
The dropout ratio for the embeddings. The dropout ratio for the embeddings.
attn_pdrop (:obj:`float`, optional, defaults to 0.1): attn_pdrop (:obj:`float`, `optional`, defaults to 0.1):
The dropout ratio for the attention. The dropout ratio for the attention.
layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-6): layer_norm_epsilon (:obj:`float`, `optional`, defaults to 1e-6):
The epsilon to use in the layer normalization layers The epsilon to use in the layer normalization layers
initializer_range (:obj:`float`, optional, defaults to 0.02): initializer_range (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
Example:: Examples::
>>> from transformers import CTRLModel, CTRLConfig >>> from transformers import CTRLModel, CTRLConfig
......
...@@ -33,50 +33,51 @@ DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = { ...@@ -33,50 +33,51 @@ DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class DistilBertConfig(PretrainedConfig): class DistilBertConfig(PretrainedConfig):
r""" r"""
This is the configuration class to store the configuration of a :class:`~transformers.DistilBertModel`. This is the configuration class to store the configuration of a :class:`~transformers.DistilBertModel` or a
It is used to instantiate a DistilBERT model according to the specified arguments, defining the model :class:`~transformers.TFDistilBertModel`. It is used to instantiate a DistilBERT model according to the specified
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
the DistilBERT `distilbert-base-uncased <https://huggingface.co/distilbert-base-uncased>`__ architecture. configuration to that of the DistilBERT
`distilbert-base-uncased <https://huggingface.co/distilbert-base-uncased>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig` to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
for more information. for more information.
Args: Args:
vocab_size (:obj:`int`, optional, defaults to 30522): vocab_size (:obj:`int`, `optional`, defaults to 30522):
Vocabulary size of the DistilBERT model. Defines the different tokens that Vocabulary size of the DistilBERT model. Defines the number of different tokens that can be represented by the
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.BertModel`. :obj:`inputs_ids` passed when calling :class:`~transformers.DistilBertModel` or
max_position_embeddings (:obj:`int`, optional, defaults to 512): :class:`~transformers.TFDistilBertModel`.
max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
The maximum sequence length that this model might ever be used with. The maximum sequence length that this model might ever be used with.
Typically set this to something large just in case (e.g., 512 or 1024 or 2048). Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
sinusoidal_pos_embds (:obj:`boolean`, optional, defaults to :obj:`False`): sinusoidal_pos_embds (:obj:`boolean`, `optional`, defaults to :obj:`False`):
Whether to use sinusoidal positional embeddings. Whether to use sinusoidal positional embeddings.
n_layers (:obj:`int`, optional, defaults to 6): n_layers (:obj:`int`, `optional`, defaults to 6):
Number of hidden layers in the Transformer encoder. Number of hidden layers in the Transformer encoder.
n_heads (:obj:`int`, optional, defaults to 12): n_heads (:obj:`int`, `optional`, defaults to 12):
Number of attention heads for each attention layer in the Transformer encoder. Number of attention heads for each attention layer in the Transformer encoder.
dim (:obj:`int`, optional, defaults to 768): dim (:obj:`int`, `optional`, defaults to 768):
Dimensionality of the encoder layers and the pooler layer. Dimensionality of the encoder layers and the pooler layer.
hidden_dim (:obj:`int`, optional, defaults to 3072): hidden_dim (:obj:`int`, `optional`, defaults to 3072):
The size of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. The size of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
dropout (:obj:`float`, optional, defaults to 0.1): dropout (:obj:`float`, `optional`, defaults to 0.1):
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler. The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (:obj:`float`, optional, defaults to 0.1): attention_dropout (:obj:`float`, `optional`, defaults to 0.1):
The dropout ratio for the attention probabilities. The dropout ratio for the attention probabilities.
activation (:obj:`str` or :obj:`function`, optional, defaults to "gelu"): activation (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. The non-linear activation function (function or string) in the encoder and pooler.
If string, "gelu", "relu", "swish" and "gelu_new" are supported. If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
initializer_range (:obj:`float`, optional, defaults to 0.02): initializer_range (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
qa_dropout (:obj:`float`, optional, defaults to 0.1): qa_dropout (:obj:`float`, `optional`, defaults to 0.1):
The dropout probabilities used in the question answering model The dropout probabilities used in the question answering model
:class:`~transformers.DistilBertForQuestionAnswering`. :class:`~transformers.DistilBertForQuestionAnswering`.
seq_classif_dropout (:obj:`float`, optional, defaults to 0.2): seq_classif_dropout (:obj:`float`, `optional`, defaults to 0.2):
The dropout probabilities used in the sequence classification and the multiple choice model The dropout probabilities used in the sequence classification and the multiple choice model
:class:`~transformers.DistilBertForSequenceClassification`. :class:`~transformers.DistilBertForSequenceClassification`.
Example:: Examples::
>>> from transformers import DistilBertModel, DistilBertConfig >>> from transformers import DistilBertModel, DistilBertConfig
......
...@@ -32,8 +32,12 @@ class DPRConfig(PretrainedConfig): ...@@ -32,8 +32,12 @@ class DPRConfig(PretrainedConfig):
:class:`~transformers.DPRConfig` is the configuration class to store the configuration of a :class:`~transformers.DPRConfig` is the configuration class to store the configuration of a
`DPRModel`. `DPRModel`.
This is the configuration class to store the configuration of a `DPRContextEncoder`, `DPRQuestionEncoder`, or a `DPRReader`. This is the configuration class to store the configuration of a :class:`~transformers.DPRContextEncoder`,
It is used to instantiate the components of the DPR model. :class:`~transformers.DPRQuestionEncoder`, or a :class:`~transformers.DPRReader`. It is used to instantiate the
components of the DPR model.
This class is a subclass of :class:`~transformers.BertConfig`. Please check the
superclass for the documentation of all kwargs.
Args: Args:
vocab_size (:obj:`int`, `optional`, defaults to 30522): vocab_size (:obj:`int`, `optional`, defaults to 30522):
......
...@@ -33,11 +33,11 @@ ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP = { ...@@ -33,11 +33,11 @@ ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class ElectraConfig(PretrainedConfig): class ElectraConfig(PretrainedConfig):
r""" r"""
This is the configuration class to store the configuration of a :class:`~transformers.ElectraModel`. This is the configuration class to store the configuration of a :class:`~transformers.ElectraModel` or a
It is used to instantiate an ELECTRA model according to the specified arguments, defining the model :class:`~transformers.TFElectraModel`. It is used to instantiate a ELECTRA model according to the specified
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
the ELECTRA `google/electra-small-discriminator <https://huggingface.co/google/electra-small-discriminator>`__ configuration to that of the ELECTRA
architecture. `google/electra-small-discriminator <https://huggingface.co/google/electra-small-discriminator>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig` to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
...@@ -45,59 +45,61 @@ class ElectraConfig(PretrainedConfig): ...@@ -45,59 +45,61 @@ class ElectraConfig(PretrainedConfig):
Args: Args:
vocab_size (:obj:`int`, optional, defaults to 30522): vocab_size (:obj:`int`, `optional`, defaults to 30522):
Vocabulary size of the ELECTRA model. Defines the different tokens that Vocabulary size of the ELECTRA model. Defines the number of different tokens that can be represented by the
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.ElectraModel`. :obj:`inputs_ids` passed when calling :class:`~transformers.ElectraModel` or
embedding_size (:obj:`int`, optional, defaults to 128): :class:`~transformers.TFElectraModel`.
embedding_size (:obj:`int`, `optional`, defaults to 128):
Dimensionality of the encoder layers and the pooler layer. Dimensionality of the encoder layers and the pooler layer.
hidden_size (:obj:`int`, optional, defaults to 256): hidden_size (:obj:`int`, `optional`, defaults to 256):
Dimensionality of the encoder layers and the pooler layer. Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (:obj:`int`, optional, defaults to 12): num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
Number of hidden layers in the Transformer encoder. Number of hidden layers in the Transformer encoder.
num_attention_heads (:obj:`int`, optional, defaults to 4): num_attention_heads (:obj:`int`, `optional`, defaults to 4):
Number of attention heads for each attention layer in the Transformer encoder. Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (:obj:`int`, optional, defaults to 1024): intermediate_size (:obj:`int`, `optional`, defaults to 1024):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "gelu"): hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. The non-linear activation function (function or string) in the encoder and pooler.
If string, "gelu", "relu", "swish" and "gelu_new" are supported. If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1): hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler. The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1): attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout ratio for the attention probabilities. The dropout ratio for the attention probabilities.
max_position_embeddings (:obj:`int`, optional, defaults to 512): max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
The maximum sequence length that this model might ever be used with. The maximum sequence length that this model might ever be used with.
Typically set this to something large just in case (e.g., 512 or 1024 or 2048). Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (:obj:`int`, optional, defaults to 2): type_vocab_size (:obj:`int`, `optional`, defaults to 2):
The vocabulary size of the `token_type_ids` passed into :class:`~transformers.ElectraModel`. The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.ElectraModel` or
initializer_range (:obj:`float`, optional, defaults to 0.02): :class:`~transformers.TFElectraModel`.
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (:obj:`float`, optional, defaults to 1e-12): layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
The epsilon used by the layer normalization layers. The epsilon used by the layer normalization layers.
summary_type (:obj:`string`, optional, defaults to "first"): summary_type (:obj:`str`, `optional`, defaults to :obj:`"first"`):
Argument used when doing sequence summary. Used in for the multiple choice head in Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
:class:`~transformers.ElectraForMultipleChoice`.
Is one of the following options: Has to be one of the following options:
- 'last' => take the last token hidden state (like XLNet) - :obj:`"last"`: Take the last token hidden state (like XLNet).
- 'first' => take the first token hidden state (like Bert) - :obj:`"first"`: Take the first token hidden state (like BERT).
- 'mean' => take the mean of all tokens hidden states - :obj:`"mean"`: Take the mean of all tokens hidden states.
- 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2) - :obj:`"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
- 'attn' => Not implemented now, use multi-head attention - :obj:`"attn"`: Not implemented now, use multi-head attention.
summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`): summary_use_proj (:obj:`bool`, `optional`, defaults to :obj:`True`):
Argument used when doing sequence summary. Used in for the multiple choice head in Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
:class:`~transformers.ElectraForMultipleChoice`.
Add a projection after the vector extraction Whether or not to add a projection after the vector extraction.
summary_activation (:obj:`string` or :obj:`None`, optional): summary_activation (:obj:`str`, `optional`):
Argument used when doing sequence summary. Used in for the multiple choice head in Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
:class:`~transformers.ElectraForMultipleChoice`.
'gelu' => add a gelu activation to the output, Other => no activation. Pass :obj:`"gelu"` for a gelu activation to the output, any other value will result in no activation.
summary_last_dropout (:obj:`float`, optional, defaults to 0.0): summary_last_dropout (:obj:`float`, `optional`, defaults to 0.0):
Argument used when doing sequence summary. Used in for the multiple choice head in Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
:class:`~transformers.ElectraForMultipleChoice`.
Add a dropout after the projection and activation The dropout ratio to be used after the projection and activation.
Example:: Examples::
>>> from transformers import ElectraModel, ElectraConfig >>> from transformers import ElectraModel, ElectraConfig
......
...@@ -25,22 +25,24 @@ logger = logging.get_logger(__name__) ...@@ -25,22 +25,24 @@ logger = logging.get_logger(__name__)
class EncoderDecoderConfig(PretrainedConfig): class EncoderDecoderConfig(PretrainedConfig):
r""" r"""
:class:`~transformers.EncoderDecoderConfig` is the configuration class to store the configuration of a `EncoderDecoderModel`. :class:`~transformers.EncoderDecoderConfig` is the configuration class to store the configuration of a
:class:`~transformers.EncoderDecoderModel`. It is used to instantiate an Encoder Decoder model according to the
specified arguments, defining the encoder and decoder configs.
It is used to instantiate an Encoder Decoder model according to the specified arguments, defining the encoder and decoder configs. Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
Configuration objects inherit from :class:`~transformers.PretrainedConfig` to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
and can be used to control the model outputs. for more information.
See the documentation for :class:`~transformers.PretrainedConfig` for more information.
Args: Args:
kwargs (`optional`): kwargs (`optional`):
Remaining dictionary of keyword arguments. Notably: Dictionary of keyword arguments. Notably:
encoder (:class:`PretrainedConfig`, optional, defaults to `None`):
An instance of a configuration object that defines the encoder config.
decoder (:class:`PretrainedConfig`, optional, defaults to `None`):
An instance of a configuration object that defines the decoder config.
Example:: - **encoder** (:class:`~transformers.PretrainedConfig`, `optional`) -- An instance of a configuration
object that defines the encoder config.
- **decoder** (:class:`~transformers.PretrainedConfig`, `optional`) -- An instance of a configuration
object that defines the decoder config.
Examples::
>>> from transformers import BertConfig, EncoderDecoderConfig, EncoderDecoderModel >>> from transformers import BertConfig, EncoderDecoderConfig, EncoderDecoderModel
......
...@@ -30,11 +30,9 @@ FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = { ...@@ -30,11 +30,9 @@ FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class FlaubertConfig(XLMConfig): class FlaubertConfig(XLMConfig):
""" """
Configuration class to store the configuration of a `FlaubertModel`. This is the configuration class to store the configuration of a :class:`~transformers.FlaubertModel` or a
This is the configuration class to store the configuration of a :class:`~transformers.XLMModel`. :class:`~transformers.TFFlaubertModel`. It is used to instantiate a FlauBERT model according to the specified
It is used to instantiate an XLM model according to the specified arguments, defining the model arguments, defining the model architecture.
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
the `xlm-mlm-en-2048 <https://huggingface.co/xlm-mlm-en-2048>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig` to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
...@@ -47,95 +45,95 @@ class FlaubertConfig(XLMConfig): ...@@ -47,95 +45,95 @@ class FlaubertConfig(XLMConfig):
layerdrop (:obj:`float`, `optional`, defaults to 0.0): layerdrop (:obj:`float`, `optional`, defaults to 0.0):
Probability to drop layers during training (Fan et al., Reducing Transformer Depth on Demand Probability to drop layers during training (Fan et al., Reducing Transformer Depth on Demand
with Structured Dropout. ICLR 2020) with Structured Dropout. ICLR 2020)
vocab_size (:obj:`int`, optional, defaults to 30145): vocab_size (:obj:`int`, `optional`, defaults to 30145):
Vocabulary size of the Flaubert model. Defines the different tokens that Vocabulary size of the FlauBERT model. Defines the number of different tokens that can be represented by the
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.FlaubertModel`. :obj:`inputs_ids` passed when calling :class:`~transformers.FlaubertModel` or
emb_dim (:obj:`int`, optional, defaults to 2048): :class:`~transformers.TFFlaubertModel`.
emb_dim (:obj:`int`, `optional`, defaults to 2048):
Dimensionality of the encoder layers and the pooler layer. Dimensionality of the encoder layers and the pooler layer.
n_layer (:obj:`int`, optional, defaults to 12): n_layer (:obj:`int`, `optional`, defaults to 12):
Number of hidden layers in the Transformer encoder. Number of hidden layers in the Transformer encoder.
n_head (:obj:`int`, optional, defaults to 16): n_head (:obj:`int`, `optional`, defaults to 16):
Number of attention heads for each attention layer in the Transformer encoder. Number of attention heads for each attention layer in the Transformer encoder.
dropout (:obj:`float`, optional, defaults to 0.1): dropout (:obj:`float`, `optional`, defaults to 0.1):
The dropout probability for all fully connected The dropout probability for all fully connected
layers in the embeddings, encoder, and pooler. layers in the embeddings, encoder, and pooler.
attention_dropout (:obj:`float`, optional, defaults to 0.1): attention_dropout (:obj:`float`, `optional`, defaults to 0.1):
The dropout probability for the attention mechanism The dropout probability for the attention mechanism
gelu_activation (:obj:`boolean`, optional, defaults to :obj:`True`): gelu_activation (:obj:`bool`, `optional`, defaults to :obj:`True`):
The non-linear activation function (function or string) in the Whether or not to use a `gelu` actibation instead of `relu`.
encoder and pooler. If set to `True`, "gelu" will be used instead of "relu". sinusoidal_embeddings (:obj:`bool`, `optional`, defaults to :obj:`False`):
sinusoidal_embeddings (:obj:`boolean`, optional, defaults to :obj:`False`): Whether or not to use sinusoidal positional embeddings instead of absolute positional embeddings.
Whether to use sinusoidal positional embeddings instead of absolute positional embeddings. causal (:obj:`bool`, `optional`, defaults to :obj:`False`):
causal (:obj:`boolean`, optional, defaults to :obj:`False`): Whether or not the model shoul behave in a causal manner.
Set this to `True` for the model to behave in a causal manner.
Causal models use a triangular attention mask in order to only attend to the left-side context instead Causal models use a triangular attention mask in order to only attend to the left-side context instead
if a bidirectional context. if a bidirectional context.
asm (:obj:`boolean`, optional, defaults to :obj:`False`): asm (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to use an adaptive log softmax projection layer instead of a linear layer for the prediction Whether or not to use an adaptive log softmax projection layer instead of a linear layer for the prediction
layer. layer.
n_langs (:obj:`int`, optional, defaults to 1): n_langs (:obj:`int`, `optional`, defaults to 1):
The number of languages the model handles. Set to 1 for monolingual models. The number of languages the model handles. Set to 1 for monolingual models.
use_lang_emb (:obj:`boolean`, optional, defaults to :obj:`True`) use_lang_emb (:obj:`bool`, `optional`, defaults to :obj:`True`)
Whether to use language embeddings. Some models use additional language embeddings, see Whether to use language embeddings. Some models use additional language embeddings, see
`the multilingual models page <http://huggingface.co/transformers/multilingual.html#xlm-language-embeddings>`__ `the multilingual models page <http://huggingface.co/transformers/multilingual.html#xlm-language-embeddings>`__
for information on how to use them. for information on how to use them.
max_position_embeddings (:obj:`int`, optional, defaults to 512): max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
The maximum sequence length that this model might The maximum sequence length that this model might
ever be used with. Typically set this to something large just in case ever be used with. Typically set this to something large just in case
(e.g., 512 or 1024 or 2048). (e.g., 512 or 1024 or 2048).
embed_init_std (:obj:`float`, optional, defaults to 2048^-0.5): embed_init_std (:obj:`float`, `optional`, defaults to 2048^-0.5):
The standard deviation of the truncated_normal_initializer for The standard deviation of the truncated_normal_initializer for
initializing the embedding matrices. initializing the embedding matrices.
init_std (:obj:`int`, optional, defaults to 50257): init_std (:obj:`int`, `optional`, defaults to 50257):
The standard deviation of the truncated_normal_initializer for The standard deviation of the truncated_normal_initializer for
initializing all weight matrices except the embedding matrices. initializing all weight matrices except the embedding matrices.
layer_norm_eps (:obj:`float`, optional, defaults to 1e-12): layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
The epsilon used by the layer normalization layers. The epsilon used by the layer normalization layers.
bos_index (:obj:`int`, optional, defaults to 0): bos_index (:obj:`int`, `optional`, defaults to 0):
The index of the beginning of sentence token in the vocabulary. The index of the beginning of sentence token in the vocabulary.
eos_index (:obj:`int`, optional, defaults to 1): eos_index (:obj:`int`, `optional`, defaults to 1):
The index of the end of sentence token in the vocabulary. The index of the end of sentence token in the vocabulary.
pad_index (:obj:`int`, optional, defaults to 2): pad_index (:obj:`int`, `optional`, defaults to 2):
The index of the padding token in the vocabulary. The index of the padding token in the vocabulary.
unk_index (:obj:`int`, optional, defaults to 3): unk_index (:obj:`int`, `optional`, defaults to 3):
The index of the unknown token in the vocabulary. The index of the unknown token in the vocabulary.
mask_index (:obj:`int`, optional, defaults to 5): mask_index (:obj:`int`, `optional`, defaults to 5):
The index of the masking token in the vocabulary. The index of the masking token in the vocabulary.
is_encoder(:obj:`boolean`, optional, defaults to :obj:`True`): is_encoder(:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether the initialized model should be a transformer encoder or decoder as seen in Vaswani et al. Whether or not the initialized model should be a transformer encoder or decoder as seen in Vaswani et al.
summary_type (:obj:`string`, optional, defaults to "first"): summary_type (:obj:`string`, `optional`, defaults to "first"):
Argument used when doing sequence summary. Used in for the multiple choice head in Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
:class:`~transformers.XLMForSequenceClassification`.
Is one of the following options: Has to be one of the following options:
- 'last' => take the last token hidden state (like XLNet) - :obj:`"last"`: Take the last token hidden state (like XLNet).
- 'first' => take the first token hidden state (like Bert) - :obj:`"first"`: Take the first token hidden state (like BERT).
- 'mean' => take the mean of all tokens hidden states - :obj:`"mean"`: Take the mean of all tokens hidden states.
- 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2) - :obj:`"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
- 'attn' => Not implemented now, use multi-head attention - :obj:`"attn"`: Not implemented now, use multi-head attention.
summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`): summary_use_proj (:obj:`bool`, `optional`, defaults to :obj:`True`):
Argument used when doing sequence summary. Used in for the multiple choice head in Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
:class:`~transformers.XLMForSequenceClassification`.
Add a projection after the vector extraction Whether or not to add a projection after the vector extraction.
summary_activation (:obj:`string` or :obj:`None`, optional): summary_activation (:obj:`str`, `optional`):
Argument used when doing sequence summary. Used in for the multiple choice head in Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
:class:`~transformers.XLMForSequenceClassification`.
'tanh' => add a tanh activation to the output, Other => no activation. Pass :obj:`"tanh"` for a tanh activation to the output, any other value will result in no activation.
summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`): summary_proj_to_labels (:obj:`bool`, `optional`, defaults to :obj:`True`):
Argument used when doing sequence summary. Used in for the multiple choice head in Used in the sequence classification and multiple choice models.
:class:`~transformers.XLMForSequenceClassification`.
If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False. Whether the projection outputs should have :obj:`config.num_labels` or :obj:`config.hidden_size` classes.
summary_first_dropout (:obj:`float`, optional, defaults to 0.1): summary_first_dropout (:obj:`float`, `optional`, defaults to 0.1):
Argument used when doing sequence summary. Used in for the multiple choice head in Used in the sequence classification and multiple choice models.
:class:`~transformers.XLMForSequenceClassification`.
Add a dropout before the projection and activation The dropout ratio to be used after the projection and activation.
start_n_top (:obj:`int`, optional, defaults to 5): start_n_top (:obj:`int`, `optional`, defaults to 5):
Used in the SQuAD evaluation script for XLM and XLNet. Used in the SQuAD evaluation script.
end_n_top (:obj:`int`, optional, defaults to 5): end_n_top (:obj:`int`, `optional`, defaults to 5):
Used in the SQuAD evaluation script for XLM and XLNet. Used in the SQuAD evaluation script.
mask_token_id (:obj:`int`, optional, defaults to 0): mask_token_id (:obj:`int`, `optional`, defaults to 0):
Model agnostic parameter to identify masked tokens when generating text in an MLM context. Model agnostic parameter to identify masked tokens when generating text in an MLM context.
lang_id (:obj:`int`, optional, defaults to 1): lang_id (:obj:`int`, `optional`, defaults to 1):
The ID of the language used by the model. This parameter is used when generating The ID of the language used by the model. This parameter is used when generating
text in a given language. text in a given language.
""" """
......
...@@ -18,7 +18,6 @@ ...@@ -18,7 +18,6 @@
import copy import copy
from .configuration_utils import PretrainedConfig from .configuration_utils import PretrainedConfig
from .file_utils import add_start_docstrings_to_callable
from .utils import logging from .utils import logging
...@@ -27,33 +26,54 @@ logger = logging.get_logger(__name__) ...@@ -27,33 +26,54 @@ logger = logging.get_logger(__name__)
FSMT_PRETRAINED_CONFIG_ARCHIVE_MAP = {} FSMT_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
FSMT_CONFIG_ARGS_DOC = r""" class DecoderConfig(PretrainedConfig):
r"""
Configuration class for FSMT's decoder specific things.
note: this is a private helper class
"""
model_type = "fsmt_decoder"
def __init__(self, vocab_size=0, bos_token_id=0):
super().__init__()
self.vocab_size = vocab_size
self.bos_token_id = bos_token_id
class FSMTConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a :class:`~transformers.FSMTModel`. It is used to
instantiate a FSMT model according to the specified arguments, defining the model architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
for more information.
Args: Args:
langs (:obj:`List[str]`): langs (:obj:`List[str]`):
source language, target_language (e.g. ['en', 'ru']) A list with source language and target_language (e.g., ['en', 'ru']).
src_vocab_size (:obj:`int`): src_vocab_size (:obj:`int`):
defines the different tokens that can be represented by `inputs_ids` passed to the forward Vocabulary size of the encoder. Defines the number of different tokens that can be represented by the
method in the encoder. :obj:`inputs_ids` passed to the forward method in the encoder.
tgt_vocab_size (:obj:`int`): tgt_vocab_size (:obj:`int`):
defines the different tokens that can be represented by `inputs_ids` passed to the forward Vocabulary size of the decoder. Defines the number of different tokens that can be represented by the
method in the decoder. :obj:`inputs_ids` passed to the forward method in the decoder.
d_model (:obj:`int`, `optional`, defaults to 1024): d_model (:obj:`int`, `optional`, defaults to 1024):
Dimensionality of the layers and the pooler layer. Dimensionality of the layers and the pooler layer.
encoder_layers (:obj:`int`, `optional`, defaults to 12): encoder_layers (:obj:`int`, `optional`, defaults to 12):
Number of encoder layers, 16 for pegasus, 6 for bart-base and marian Number of encoder layers.
decoder_layers (:obj:`int`, `optional`, defaults to 12): decoder_layers (:obj:`int`, `optional`, defaults to 12):
Number of decoder layers, 16 for pegasus, 6 for bart-base and marian Number of decoder layers.
encoder_attention_heads (:obj:`int`, `optional`, defaults to 16): encoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
Number of attention heads for each attention layer in the Transformer encoder. Number of attention heads for each attention layer in the Transformer encoder.
decoder_attention_heads (:obj:`int`, `optional`, defaults to 16): decoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
Number of attention heads for each attention layer in the Transformer decoder. Number of attention heads for each attention layer in the Transformer decoder.
decoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096): decoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in decoder. Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
encoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096): encoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in decoder. Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to "relu"): activation_function (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"relu"`):
The non-linear activation function (function or string) in the encoder and pooler. The non-linear activation function (function or string) in the encoder and pooler.
If string, "gelu", "relu", "swish" and "gelu_new" are supported. If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
dropout (:obj:`float`, `optional`, defaults to 0.1): dropout (:obj:`float`, `optional`, defaults to 0.1):
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler. The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (:obj:`float`, `optional`, defaults to 0.0): attention_dropout (:obj:`float`, `optional`, defaults to 0.0):
...@@ -74,7 +94,7 @@ FSMT_CONFIG_ARGS_DOC = r""" ...@@ -74,7 +94,7 @@ FSMT_CONFIG_ARGS_DOC = r"""
eos_token_id (:obj:`int`, `optional`, defaults to 2) eos_token_id (:obj:`int`, `optional`, defaults to 2)
End of stream token id. End of stream token id.
decoder_start_token_id (:obj:`int`, `optional`): decoder_start_token_id (:obj:`int`, `optional`):
This model starts decoding with `eos_token_id` This model starts decoding with :obj:`eos_token_id`
encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0): encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
Google "layerdrop arxiv", as its not explainable in one line. Google "layerdrop arxiv", as its not explainable in one line.
decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0): decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
...@@ -92,26 +112,14 @@ FSMT_CONFIG_ARGS_DOC = r""" ...@@ -92,26 +112,14 @@ FSMT_CONFIG_ARGS_DOC = r"""
early_stopping (:obj:`bool`, `optional`, defaults to :obj:`False`) early_stopping (:obj:`bool`, `optional`, defaults to :obj:`False`)
Flag that will be used by default in the :obj:`generate` method of the model. Whether to stop Flag that will be used by default in the :obj:`generate` method of the model. Whether to stop
the beam search when at least ``num_beams`` sentences are finished per batch or not. the beam search when at least ``num_beams`` sentences are finished per batch or not.
"""
Examples::
class DecoderConfig(PretrainedConfig): >>> from transformers import FSMTConfig, FSMTModel
r"""
Configuration class for FSMT's decoder specific things.
note: this is a private helper class
"""
model_type = "fsmt_decoder"
def __init__(self, vocab_size=0, bos_token_id=0):
super().__init__()
self.vocab_size = vocab_size
self.bos_token_id = bos_token_id
>>> config = FSMTConfig.from_pretrained('facebook/wmt19-en-ru')
>>> model = FSMTModel(config)
@add_start_docstrings_to_callable(FSMT_CONFIG_ARGS_DOC)
class FSMTConfig(PretrainedConfig):
r"""
Configuration class for FSMT.
""" """
model_type = "fsmt" model_type = "fsmt"
...@@ -149,17 +157,6 @@ class FSMTConfig(PretrainedConfig): ...@@ -149,17 +157,6 @@ class FSMTConfig(PretrainedConfig):
early_stopping=False, early_stopping=False,
**common_kwargs **common_kwargs
): ):
r"""
:class:`~transformers.FSMTConfig` is the configuration class for `FSMTModel`.
Examples::
>>> from transformers import FSMTConfig, FSMTModel
>>> config = FSMTConfig.from_pretrained('facebook/wmt19-en-ru')
>>> model = FSMTModel(config)
"""
if "hidden_size" in common_kwargs: if "hidden_size" in common_kwargs:
raise ValueError("hidden size is called d_model") raise ValueError("hidden size is called d_model")
super().__init__( super().__init__(
......
...@@ -36,20 +36,21 @@ FUNNEL_PRETRAINED_CONFIG_ARCHIVE_MAP = { ...@@ -36,20 +36,21 @@ FUNNEL_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class FunnelConfig(PretrainedConfig): class FunnelConfig(PretrainedConfig):
r""" r"""
This is the configuration class to store the configuration of a :class:`~transformers.FunnelModel`. This is the configuration class to store the configuration of a :class:`~transformers.FunnelModel` or a
It is used to instantiate an Funnel Transformer model according to the specified arguments, defining the model :class:`~transformers.TFBertModel`. It is used to instantiate a Funnel Transformer model according to the specified
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
the Funnel Transformer `funnel-transformer/small <https://huggingface.co/funnel-transformer/small>`__ architecture. configuration to that of the Funnel Transformer `funnel-transformer/small
<https://huggingface.co/funnel-transformer/small>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig` to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
for more information. for more information.
Args: Args:
vocab_size (:obj:`int`, `optional`, defaults to 30522): vocab_size (:obj:`int`, `optional`, defaults to 30522):
Vocabulary size of the Funnel transformer. Defines the different tokens that Vocabulary size of the Funnel transformer. Defines the number of different tokens that can be represented
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.FunnelModel`. by the :obj:`inputs_ids` passed when calling :class:`~transformers.FunnelModel` or
:class:`~transformers.TFFunnelModel`.
block_sizes (:obj:`List[int]`, `optional`, defaults to :obj:`[4, 4, 4]`): block_sizes (:obj:`List[int]`, `optional`, defaults to :obj:`[4, 4, 4]`):
The sizes of the blocks used in the model. The sizes of the blocks used in the model.
block_repeats (:obj:`List[int]`, `optional`): block_repeats (:obj:`List[int]`, `optional`):
...@@ -77,7 +78,8 @@ class FunnelConfig(PretrainedConfig): ...@@ -77,7 +78,8 @@ class FunnelConfig(PretrainedConfig):
The maximum sequence length that this model might ever be used with. The maximum sequence length that this model might ever be used with.
Typically set this to something large just in case (e.g., 512 or 1024 or 2048). Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (:obj:`int`, `optional`, defaults to 3): type_vocab_size (:obj:`int`, `optional`, defaults to 3):
The vocabulary size of the `token_type_ids` passed into :class:`~transformers.FunnelModel`. The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.FunnelModel` or
:class:`~transformers.TFFunnelModel`.
initializer_range (:obj:`float`, `optional`, defaults to 0.1): initializer_range (:obj:`float`, `optional`, defaults to 0.1):
The standard deviation of the `uniform initializer` for initializing all weight matrices in attention The standard deviation of the `uniform initializer` for initializing all weight matrices in attention
layers. layers.
......
...@@ -32,10 +32,10 @@ GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP = { ...@@ -32,10 +32,10 @@ GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class GPT2Config(PretrainedConfig): class GPT2Config(PretrainedConfig):
""" """
This is the configuration class to store the configuration of a :class:`~transformers.GPT2Model`. This is the configuration class to store the configuration of a :class:`~transformers.GPT2Model` or a
It is used to instantiate an GPT-2 model according to the specified arguments, defining the model :class:`~transformers.TFGPT2Model`. It is used to instantiate a GPT-2 model according to the specified
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
the GPT-2 `small <https://huggingface.co/gpt2>`__ architecture. configuration to that of the GPT-2 `small <https://huggingface.co/gpt2>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig` to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
...@@ -43,60 +43,66 @@ class GPT2Config(PretrainedConfig): ...@@ -43,60 +43,66 @@ class GPT2Config(PretrainedConfig):
Args: Args:
vocab_size (:obj:`int`, optional, defaults to 50257): vocab_size (:obj:`int`, `optional`, defaults to 50257):
Vocabulary size of the GPT-2 model. Defines the different tokens that Vocabulary size of the GPT-2 model. Defines the number of different tokens that can be represented by the
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.GPT2Model`. :obj:`inputs_ids` passed when calling :class:`~transformers.GPT2Model` or
n_positions (:obj:`int`, optional, defaults to 1024): :class:`~transformers.TFGPT2Model`.
n_positions (:obj:`int`, `optional`, defaults to 1024):
The maximum sequence length that this model might ever be used with. The maximum sequence length that this model might ever be used with.
Typically set this to something large just in case (e.g., 512 or 1024 or 2048). Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
n_ctx (:obj:`int`, optional, defaults to 1024): n_ctx (:obj:`int`, `optional`, defaults to 1024):
Dimensionality of the causal mask (usually same as n_positions). Dimensionality of the causal mask (usually same as n_positions).
n_embd (:obj:`int`, optional, defaults to 768): n_embd (:obj:`int`, `optional`, defaults to 768):
Dimensionality of the embeddings and hidden states. Dimensionality of the embeddings and hidden states.
n_layer (:obj:`int`, optional, defaults to 12): n_layer (:obj:`int`, `optional`, defaults to 12):
Number of hidden layers in the Transformer encoder. Number of hidden layers in the Transformer encoder.
n_head (:obj:`int`, optional, defaults to 12): n_head (:obj:`int`, `optional`, defaults to 12):
Number of attention heads for each attention layer in the Transformer encoder. Number of attention heads for each attention layer in the Transformer encoder.
n_inner (:obj:`int`, optional, defaults to None): n_inner (:obj:`int`, `optional`, defaults to None):
Dimensionality of the inner feed-forward layers. :obj:`None` will set it to 4 times n_embd Dimensionality of the inner feed-forward layers. :obj:`None` will set it to 4 times n_embd
activation_function (:obj:`str`, optional, defaults to 'gelu'): activation_function (:obj:`str`, `optional`, defaults to :obj:`"gelu"`):
Activation function selected in the list ["relu", "swish", "gelu", "tanh", "gelu_new"]. Activation function, to be selected in the list :obj:`["relu", "swish", "gelu", "tanh", "gelu_new"]`.
resid_pdrop (:obj:`float`, optional, defaults to 0.1): resid_pdrop (:obj:`float`, `optional`, defaults to 0.1):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
embd_pdrop (:obj:`int`, optional, defaults to 0.1): embd_pdrop (:obj:`int`, `optional`, defaults to 0.1):
The dropout ratio for the embeddings. The dropout ratio for the embeddings.
attn_pdrop (:obj:`float`, optional, defaults to 0.1): attn_pdrop (:obj:`float`, `optional`, defaults to 0.1):
The dropout ratio for the attention. The dropout ratio for the attention.
layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-5): layer_norm_epsilon (:obj:`float`, `optional`, defaults to 1e-5):
The epsilon to use in the layer normalization layers The epsilon to use in the layer normalization layers
initializer_range (:obj:`float`, optional, defaults to 0.02): initializer_range (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
summary_type (:obj:`string`, optional, defaults to "cls_index"): summary_type (:obj:`string`, `optional`, defaults to :obj:`"cls_index"`):
Argument used when doing sequence summary, used in the models
:class:`~transformers.GPT2DoubleHeadsModel` and :class:`~transformers.TFGPT2DoubleHeadsModel`.
Has to be one of the following options:
- :obj:`"last"`: Take the last token hidden state (like XLNet).
- :obj:`"first"`: Take the first token hidden state (like BERT).
- :obj:`"mean"`: Take the mean of all tokens hidden states.
- :obj:`"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
- :obj:`"attn"`: Not implemented now, use multi-head attention.
summary_use_proj (:obj:`bool`, `optional`, defaults to :obj:`True`):
Argument used when doing sequence summary, used in the models
:class:`~transformers.GPT2DoubleHeadsModel` and :class:`~transformers.TFGPT2DoubleHeadsModel`.
Whether or not to add a projection after the vector extraction.
summary_activation (:obj:`str`, `optional`):
Argument used when doing sequence summary. Used in for the multiple choice head in Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.GPT2DoubleHeadsModel`. :class:`~transformers.GPT2DoubleHeadsModel`.
Is one of the following options:
Pass :obj:`"tanh"` for a tanh activation to the output, any other value will result in no activation.
- 'last' => take the last token hidden state (like XLNet) summary_proj_to_labels (:obj:`bool`, `optional`, defaults to :obj:`True`):
- 'first' => take the first token hidden state (like Bert) Argument used when doing sequence summary, used in the models
- 'mean' => take the mean of all tokens hidden states :class:`~transformers.GPT2DoubleHeadsModel` and :class:`~transformers.TFGPT2DoubleHeadsModel`.
- 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)
- 'attn' => Not implemented now, use multi-head attention Whether the projection outputs should have :obj:`config.num_labels` or :obj:`config.hidden_size` classes.
summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`): summary_first_dropout (:obj:`float`, `optional`, defaults to 0.1):
Argument used when doing sequence summary. Used in for the multiple choice head in Argument used when doing sequence summary, used in the models
:class:`~transformers.GPT2DoubleHeadsModel`. :class:`~transformers.GPT2DoubleHeadsModel` and :class:`~transformers.TFGPT2DoubleHeadsModel`.
Add a projection after the vector extraction
summary_activation (:obj:`string` or :obj:`None`, optional): The dropout ratio to be used after the projection and activation.
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.GPT2DoubleHeadsModel`.
'tanh' => add a tanh activation to the output, Other => no activation.
summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.GPT2DoubleHeadsModel`.
If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.
summary_first_dropout (:obj:`float`, optional, defaults to 0.1):
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.GPT2DoubleHeadsModel`.
Add a dropout before the projection and activation
Example:: Example::
......
...@@ -33,6 +33,10 @@ LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = { ...@@ -33,6 +33,10 @@ LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class LongformerConfig(RobertaConfig): class LongformerConfig(RobertaConfig):
r""" r"""
This is the configuration class to store the configuration of a :class:`~transformers.LongformerModel` or a
:class:`~transformers.TFLongformerModel`. It is used to instantiate a Longformer model according to the specified
arguments, defining the model architecture.
This is the configuration class to store the configuration of a :class:`~transformers.LongformerModel`. This is the configuration class to store the configuration of a :class:`~transformers.LongformerModel`.
It is used to instantiate an Longformer model according to the specified arguments, defining the model It is used to instantiate an Longformer model according to the specified arguments, defining the model
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
...@@ -42,8 +46,8 @@ class LongformerConfig(RobertaConfig): ...@@ -42,8 +46,8 @@ class LongformerConfig(RobertaConfig):
It reuses the same defaults. Please check the parent class for more information. It reuses the same defaults. Please check the parent class for more information.
Args: Args:
attention_window (:obj:`int` or :obj:`List[int]`, optional, defaults to 512): attention_window (:obj:`int` or :obj:`List[int]`, `optional`, defaults to 512):
Size of an attention window around each token. If :obj:`int`, use the same size for all layers. Size of an attention window around each token. If an :obj:`int`, use the same size for all layers.
To specify a different window size for each layer, use a :obj:`List[int]` where To specify a different window size for each layer, use a :obj:`List[int]` where
``len(attention_window) == num_hidden_layers``. ``len(attention_window) == num_hidden_layers``.
......
...@@ -29,83 +29,91 @@ LXMERT_PRETRAINED_CONFIG_ARCHIVE_MAP = { ...@@ -29,83 +29,91 @@ LXMERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class LxmertConfig(PretrainedConfig): class LxmertConfig(PretrainedConfig):
r""" r"""
This is the configuration class to store the configuration of a :class:`~transformers.BertModel`. This is the configuration class to store the configuration of a :class:`~transformers.LxmertModel` or a
It is used to instantiate an Lxmert model according to the specified arguments, defining the model :class:`~transformers.TFLxmertModel`. It is used to instantiate a LXMERT model according to the specified
architecture. arguments, defining the model architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
for more information.
Args: Args:
vocab_size (:obj:`int`, optional, defaults to 30522): vocab_size (:obj:`int`, `optional`, defaults to 30522):
Vocabulary size of the BERT model. Defines the different tokens that Vocabulary size of the LXMERT model. Defines the number of different tokens that can be represented by the
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.BertModel`. :obj:`inputs_ids` passed when calling :class:`~transformers.LxmertModel` or
hidden_size (:obj:`int`, optional, defaults to 768): :class:`~transformers.TFLxmertModel`.
hidden_size (:obj:`int`, `optional`, defaults to 768):
Dimensionality of the encoder layers and the pooler layer. Dimensionality of the encoder layers and the pooler layer.
r_layers (:obj:`int`, optional, defaults to 5): r_layers (:obj:`int`, `optional`, defaults to 5):
Number of hidden layers in the Transformer visual encoder. Number of hidden layers in the Transformer visual encoder.
l_layers (:obj:`int`, optional, defaults to 9): l_layers (:obj:`int`, `optional`, defaults to 9):
Number of hidden layers in the Transformer language encoder. Number of hidden layers in the Transformer language encoder.
x_layers (:obj:`int`, optional, defaults to 5): x_layers (:obj:`int`, `optional`, defaults to 5):
Number of hidden layers in the Transformer cross modality encoder. Number of hidden layers in the Transformer cross modality encoder.
num_attention_heads (:obj:`int`, optional, defaults to 5): num_attention_heads (:obj:`int`, `optional`, defaults to 5):
Number of attention heads for each attention layer in the Transformer encoder. Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (:obj:`int`, optional, defaults to 3072): intermediate_size (:obj:`int`, `optional`, defaults to 3072):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "gelu"): hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. The non-linear activation function (function or string) in the encoder and pooler.
If string, "gelu", "relu", "swish" and "gelu_new" are supported. If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1): hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler. The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1): attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout ratio for the attention probabilities. The dropout ratio for the attention probabilities.
max_position_embeddings (:obj:`int`, optional, defaults to 512): max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
The maximum sequence length that this model might ever be used with. The maximum sequence length that this model might ever be used with.
Typically set this to something large just in case (e.g., 512 or 1024 or 2048). Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (:obj:`int`, optional, defaults to 2): type_vocab_size (:obj:`int`, `optional`, defaults to 2):
The vocabulary size of the `token_type_ids` passed into :class:`~transformers.BertModel`. The vocabulary size of the `token_type_ids` passed into :class:`~transformers.BertModel`.
initializer_range (:obj:`float`, optional, defaults to 0.02): initializer_range (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (:obj:`float`, optional, defaults to 1e-12): layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
The epsilon used by the layer normalization layers. The epsilon used by the layer normalization layers.
visual_feat_dim (:obj:`int`, optional, defaults to 2048): visual_feat_dim (:obj:`int`, `optional`, defaults to 2048):
This represents the last dimension of the pooled-object features used as input for the model, This represents the last dimension of the pooled-object features used as input for the model,
representing the size of each object feature itself. representing the size of each object feature itself.
visual_pos_dim (:obj:`int`, optional, defaults to 4): visual_pos_dim (:obj:`int`, `optional`, defaults to 4):
This represents the number of spacial features that are mixed into the visual features. This represents the number of spacial features that are mixed into the visual features.
The default is set to 4 because most commonly this will represent the location of a bounding box. The default is set to 4 because most commonly this will represent the location of a bounding box.
i.e. (x, y, width, height) i.e., (x, y, width, height)
visual_loss_normalizer (:obj:`float`, optional, defaults to 1/15): visual_loss_normalizer (:obj:`float`, `optional`, defaults to 1/15):
This represents the scaling factor in which each visual loss is multiplied by if during pretraining, This represents the scaling factor in which each visual loss is multiplied by if during pretraining,
one decided to train with multiple vision-based loss objectives. one decided to train with multiple vision-based loss objectives.
num_qa_labels (:obj:`int`, optional, defaults to 9500): num_qa_labels (:obj:`int`, `optional`, defaults to 9500):
This represents the total number of different question answering (QA) labels there are. If using more than one dataset with QA, This represents the total number of different question answering (QA) labels there are. If using more than
the user will need to account for the total number of labels that all of the datasets have in total. one dataset with QA, the user will need to account for the total number of labels that all of the datasets
num_object_labels (:obj:`int`, optional, defaults to 1600): have in total.
This represents the total number of semantically unique objects that lxmert will be able to classify a pooled-object feature num_object_labels (:obj:`int`, `optional`, defaults to 1600):
as belonging too. This represents the total number of semantically unique objects that lxmert will be able to classify a
num_attr_labels (:obj:`int`, optional, defaults to 400): pooled-object feature as belonging too.
This represents the total number of semantically unique attributes that lxmert will be able to classify a pooled-object feature num_attr_labels (:obj:`int`, `optional`, defaults to 400):
as possessing. This represents the total number of semantically unique attributes that lxmert will be able to classify a
task_matched (:obj:`bool`, optional, defaults to :obj:`True`): pooled-object feature as possessing.
This task is used for sentence-image matching. If the sentence correctly describes the image the label will be 1. task_matched (:obj:`bool`, `optional`, defaults to :obj:`True`):
If the sentence does not correctly describe the image, the label will be 0. This task is used for sentence-image matching. If the sentence correctly describes the image the label
task_mask_lm (:obj:`bool`, optional, defaults to :obj:`True`): will be 1. If the sentence does not correctly describe the image, the label will be 0.
This task is the defacto masked langauge modeling used in pretraining models such as BERT. task_mask_lm (:obj:`bool`, `optional`, defaults to :obj:`True`):
task_obj_predict (:obj:`bool`, optional, defaults to :obj:`True`): Whether or not to add masked language modeling (as used in pretraining models such as BERT) to the loss
This task is set to true if the user would like to perform one of the following loss objectives: objective.
object predicition, atrribute predicition, feature regression task_obj_predict (:obj:`bool`, `optional`, defaults to :obj:`True`):
task_qa (:obj:`bool`, optional, defaults to :obj:`True`): Whether or not to add object predicition, attribute predicition and feature regression to the loss
This task specifies whether or not Lxmert will calculate the question-asnwering loss objective objective.
visual_obj_loss (:obj:`bool`, optional, defaults to :obj:`True`): task_qa (:obj:`bool`, `optional`, defaults to :obj:`True`):
This task specifies whether or not Lxmert will calculate the object-prediction loss objective Whether or not to add the question-asnwering loss to the objective
visual_attr_loss (:obj:`bool`, optional, defaults to :obj:`True`): visual_obj_loss (:obj:`bool`, `optional`, defaults to :obj:`True`):
This task specifies whether or not Lxmert will calculate the attribute-prediction loss objective Whether or not to calculate the object-prediction loss objective
visual_feat_loss (:obj:`bool`, optional, defaults to :obj:`True`): visual_attr_loss (:obj:`bool`, `optional`, defaults to :obj:`True`):
This task specifies whether or not Lxmert will calculate the feature-regression loss objective Whether or not to calculate the attribute-prediction loss objective
output_attentions (:obj:`bool`, optional, defaults to :obj:`False`): visual_feat_loss (:obj:`bool`, `optional`, defaults to :obj:`True`):
if True, the vision, langauge, and cross-modality layers will be returned Whether or not to calculate the feature-regression loss objective
output_hidden_states (:obj:`bool`, optional, defaults to :obj:`False`): output_attentions (:obj:`bool`, `optional`, defaults to :obj:`False`):
if True, final cross-modality hidden states for language and vision features will be returned Whether or not the model should return the attentions from the vision, langauge, and cross-modality
layers should be returned.
output_hidden_states (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not the model should return the hidden states from the vision, langauge, and cross-modality
layers should be returned.
""" """
model_type = "lxmert" model_type = "lxmert"
......
...@@ -22,15 +22,16 @@ logger = logging.get_logger(__name__) ...@@ -22,15 +22,16 @@ logger = logging.get_logger(__name__)
class MMBTConfig(object): class MMBTConfig(object):
"""Configuration class to store the configuration of a `MMBT Model`. """
This is the configuration class to store the configuration of a :class:`~transformers.MMBTModel`. It is used to
instantiate a MMBT model according to the specified arguments, defining the model architecture.
Args: Args:
config (:obj:`~transformers.PreTrainedConfig`): config (:class:`~transformers.PreTrainedConfig`):
Config of the underlying Transformer models. Its values are Config of the underlying Transformer models. Its values are copied over to use a single config.
copied over to use a single config. num_labels (:obj:`int`, `optional`):
num_labels (:obj:`int` or :obj:`None`, optional, defaults to `None`):
Size of final Linear layer for classification. Size of final Linear layer for classification.
modal_hidden_size (:obj:`int`, optional, defautls to 2048): modal_hidden_size (:obj:`int`, `optional`, defautls to 2048):
Embedding dimension of the non-text modality encoder. Embedding dimension of the non-text modality encoder.
""" """
......
...@@ -25,9 +25,9 @@ MOBILEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = { ...@@ -25,9 +25,9 @@ MOBILEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class MobileBertConfig(PretrainedConfig): class MobileBertConfig(PretrainedConfig):
r""" r"""
This is the configuration class to store the configuration of a :class:`~transformers.MobileBertModel`. This is the configuration class to store the configuration of a :class:`~transformers.MobileBertModel` or a
It is used to instantiate a MobileBERT model according to the specified arguments, defining the model :class:`~transformers.TFMobileBertModel`. It is used to instantiate a MobileBERT model according to the specified
architecture. arguments, defining the model architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig` to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
...@@ -35,54 +35,56 @@ class MobileBertConfig(PretrainedConfig): ...@@ -35,54 +35,56 @@ class MobileBertConfig(PretrainedConfig):
Args: Args:
vocab_size (:obj:`int`, optional, defaults to 30522): vocab_size (:obj:`int`, `optional`, defaults to 30522):
Vocabulary size of the MobileBERT model. Defines the different tokens that Vocabulary size of the MobileBERT model. Defines the number of different tokens that can be represented by
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.MobileBertModel`. the :obj:`inputs_ids` passed when calling :class:`~transformers.MobileBertModel` or
hidden_size (:obj:`int`, optional, defaults to 512): :class:`~transformers.TFMobileBertModel`.
hidden_size (:obj:`int`, `optional`, defaults to 512):
Dimensionality of the encoder layers and the pooler layer. Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (:obj:`int`, optional, defaults to 24): num_hidden_layers (:obj:`int`, `optional`, defaults to 24):
Number of hidden layers in the Transformer encoder. Number of hidden layers in the Transformer encoder.
num_attention_heads (:obj:`int`, optional, defaults to 4): num_attention_heads (:obj:`int`, `optional`, defaults to 4):
Number of attention heads for each attention layer in the Transformer encoder. Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (:obj:`int`, optional, defaults to 512): intermediate_size (:obj:`int`, `optional`, defaults to 512):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "relu"): hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"relu"`):
The non-linear activation function (function or string) in the encoder and pooler. The non-linear activation function (function or string) in the encoder and pooler.
If string, "gelu", "relu", "swish" and "gelu_new" are supported. If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
hidden_dropout_prob (:obj:`float`, optional, defaults to 0.0): hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.0):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1): attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout ratio for the attention probabilities. The dropout ratio for the attention probabilities.
max_position_embeddings (:obj:`int`, optional, defaults to 512): max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
The maximum sequence length that this model might ever be used with. The maximum sequence length that this model might ever be used with.
Typically set this to something large just in case (e.g., 512 or 1024 or 2048). Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (:obj:`int`, optional, defaults to 2): type_vocab_size (:obj:`int`, `optional`, defaults to 2):
The vocabulary size of the `token_type_ids` passed into :class:`~transformers.MobileBertModel`. The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.MobileBertModel`
initializer_range (:obj:`float`, optional, defaults to 0.02): or :class:`~transformers.TFMobileBertModel`.
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (:obj:`float`, optional, defaults to 1e-12): layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
The epsilon used by the layer normalization layers. The epsilon used by the layer normalization layers.
pad_token_id (:obj:`int`, optional, defaults to 0): pad_token_id (:obj:`int`, `optional`, defaults to 0):
The ID of the token in the word embedding to use as padding. The ID of the token in the word embedding to use as padding.
embedding_size (:obj:`int`, optional, defaults to 128): embedding_size (:obj:`int`, `optional`, defaults to 128):
The dimension of the word embedding vectors. The dimension of the word embedding vectors.
trigram_input (:obj:`bool`, optional, defaults to :obj:`True`): trigram_input (:obj:`bool`, `optional`, defaults to :obj:`True`):
Use a convolution of trigram as input. Use a convolution of trigram as input.
use_bottleneck (:obj:`bool`, optional, defaults to :obj:`True`): use_bottleneck (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether to use bottleneck in BERT. Whether to use bottleneck in BERT.
intra_bottleneck_size (:obj:`int`, optional, defaults to 128): intra_bottleneck_size (:obj:`int`, `optional`, defaults to 128):
Size of bottleneck layer output. Size of bottleneck layer output.
use_bottleneck_attention (:obj:`bool`, optional, defaults to :obj:`False`): use_bottleneck_attention (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to use attention inputs from the bottleneck transformation. Whether to use attention inputs from the bottleneck transformation.
key_query_shared_bottleneck (:obj:`bool`, optional, defaults to :obj:`True`): key_query_shared_bottleneck (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether to use the same linear transformation for query&key in the bottleneck. Whether to use the same linear transformation for query&key in the bottleneck.
num_feedforward_networks (:obj:`int`, optional, defaults to 4): num_feedforward_networks (:obj:`int`, `optional`, defaults to 4):
Number of FFNs in a block. Number of FFNs in a block.
normalization_type (:obj:`str`, optional, defaults to "no_norm"): normalization_type (:obj:`str`, `optional`, defaults to :obj:`"no_norm"`):
The normalization type in BERT. The normalization type in MobileBERT.
Example: Examples:
>>> from transformers import MobileBertModel, MobileBertConfig >>> from transformers import MobileBertModel, MobileBertConfig
......
...@@ -28,73 +28,79 @@ OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP = { ...@@ -28,73 +28,79 @@ OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class OpenAIGPTConfig(PretrainedConfig): class OpenAIGPTConfig(PretrainedConfig):
""" """
This is the configuration class to store the configuration of a :class:`~transformers.OpenAIGPTModel`. This is the configuration class to store the configuration of a :class:`~transformers.OpenAIGPTModel` or a
It is used to instantiate an GPT model according to the specified arguments, defining the model :class:`~transformers.TFOpenAIGPTModel`. It is used to instantiate a GPT model according to the specified
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
the `GPT <https://huggingface.co/openai-gpt>`__ architecture from OpenAI. configuration to that of the `GPT <https://huggingface.co/openai-gpt>`__ architecture from OpenAI.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig` to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
for more information. for more information.
Args: Args:
vocab_size (:obj:`int`, optional, defaults to 40478): vocab_size (:obj:`int`, `optional`, defaults to 40478):
Vocabulary size of the GPT model. Defines the different tokens that Vocabulary size of the GPT-2 model. Defines the number of different tokens that can be represented by the
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.CTRLModel`. :obj:`inputs_ids` passed when calling :class:`~transformers.OpenAIGPTModel` or
n_positions (:obj:`int`, optional, defaults to 512): :class:`~transformers.TFOpenAIGPTModel`.
n_positions (:obj:`int`, `optional`, defaults to 512):
The maximum sequence length that this model might ever be used with. The maximum sequence length that this model might ever be used with.
Typically set this to something large just in case (e.g., 512 or 1024 or 2048). Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
n_ctx (:obj:`int`, optional, defaults to 512): n_ctx (:obj:`int`, `optional`, defaults to 512):
Dimensionality of the causal mask (usually same as n_positions). Dimensionality of the causal mask (usually same as n_positions).
n_embd (:obj:`int`, optional, defaults to 768): n_embd (:obj:`int`, `optional`, defaults to 768):
Dimensionality of the embeddings and hidden states. Dimensionality of the embeddings and hidden states.
n_layer (:obj:`int`, optional, defaults to 12): n_layer (:obj:`int`, `optional`, defaults to 12):
Number of hidden layers in the Transformer encoder. Number of hidden layers in the Transformer encoder.
n_head (:obj:`int`, optional, defaults to 12): n_head (:obj:`int`, `optional`, defaults to 12):
Number of attention heads for each attention layer in the Transformer encoder. Number of attention heads for each attention layer in the Transformer encoder.
afn (:obj:`str` or :obj:`function`, optional, defaults to "gelu"): afn (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. The non-linear activation function (function or string) in the encoder and pooler.
If string, "gelu", "relu", "swish" and "gelu_new" are supported. If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
resid_pdrop (:obj:`float`, optional, defaults to 0.1): resid_pdrop (:obj:`float`, `optional`, defaults to 0.1):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
embd_pdrop (:obj:`int`, optional, defaults to 0.1): embd_pdrop (:obj:`int`, `optional`, defaults to 0.1):
The dropout ratio for the embeddings. The dropout ratio for the embeddings.
attn_pdrop (:obj:`float`, optional, defaults to 0.1): attn_pdrop (:obj:`float`, `optional`, defaults to 0.1):
The dropout ratio for the attention. The dropout ratio for the attention.
layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-5): layer_norm_epsilon (:obj:`float`, `optional`, defaults to 1e-5):
The epsilon to use in the layer normalization layers The epsilon to use in the layer normalization layers
initializer_range (:obj:`float`, optional, defaults to 0.02): initializer_range (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
predict_special_tokens (:obj:`boolean`, optional, defaults to :obj:`True`): predict_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether special tokens should be predicted when the model is has a language modeling head. Whether or not special tokens should be predicted when the model has a language modeling head.
summary_type (:obj:`string`, optional, defaults to "cls_index"): summary_type (:obj:`str`, `optional`, defaults to :obj:`"cls_index"`):
Argument used when doing sequence summary. Used in for the multiple choice head in Argument used when doing sequence summary, used in the models
:class:`~transformers.OpenAIGPTDoubleHeadsModel`. :class:`~transformers.OpenAIGPTDoubleHeadsModel` and :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
Is one of the following options:
Has to be one of the following options:
- 'last' => take the last token hidden state (like XLNet)
- 'first' => take the first token hidden state (like Bert) - :obj:`"last"`: Take the last token hidden state (like XLNet).
- 'mean' => take the mean of all tokens hidden states - :obj:`"first"`: Take the first token hidden state (like BERT).
- 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2) - :obj:`"mean"`: Take the mean of all tokens hidden states.
- 'attn' => Not implemented now, use multi-head attention - :obj:`"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`): - :obj:`"attn"`: Not implemented now, use multi-head attention.
Argument used when doing sequence summary. Used in for the multiple choice head in summary_use_proj (:obj:`bool`, `optional`, defaults to :obj:`True`):
:class:`~transformers.OpenAIGPTDoubleHeadsModel`. Argument used when doing sequence summary, used in the models
Add a projection after the vector extraction :class:`~transformers.OpenAIGPTDoubleHeadsModel` and :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
summary_activation (:obj:`string` or :obj:`None`, optional):
Argument used when doing sequence summary. Used in for the multiple choice head in Whether or not to add a projection after the vector extraction.
:class:`~transformers.OpenAIGPTDoubleHeadsModel`. summary_activation (:obj:`str`, `optional`):
'tanh' => add a tanh activation to the output, Other => no activation. Argument used when doing sequence summary, used in the models
summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`): :class:`~transformers.OpenAIGPTDoubleHeadsModel` and :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.OpenAIGPTDoubleHeadsModel`. Pass :obj:`"tanh"` for a tanh activation to the output, any other value will result in no activation.
If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False. summary_proj_to_labels (:obj:`bool`, `optional`, defaults to :obj:`True`):
summary_first_dropout (:obj:`float`, optional, defaults to 0.1): Argument used when doing sequence summary, used in the models
Argument used when doing sequence summary. Used in for the multiple choice head in :class:`~transformers.OpenAIGPTDoubleHeadsModel` and :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
:class:`~transformers.OpenAIGPTDoubleHeadsModel`.
Add a dropout before the projection and activation Whether the projection outputs should have :obj:`config.num_labels` or :obj:`config.hidden_size` classes.
summary_first_dropout (:obj:`float`, `optional`, defaults to 0.1):
Example:: Argument used when doing sequence summary, used in the models
:class:`~transformers.OpenAIGPTDoubleHeadsModel` and :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
The dropout ratio to be used after the projection and activation.
Examples::
>>> from transformers import OpenAIGPTConfig, OpenAIGPTModel >>> from transformers import OpenAIGPTConfig, OpenAIGPTModel
......
...@@ -29,96 +29,120 @@ REFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = { ...@@ -29,96 +29,120 @@ REFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class ReformerConfig(PretrainedConfig): class ReformerConfig(PretrainedConfig):
r""" r"""
This is the configuration class to store the configuration of a :class:`~transformers.ReformerModel`. This is the configuration class to store the configuration of a :class:`~transformers.ReformerModel`. It is used to
It is used to instantiate an Reformer model according to the specified arguments, defining the model instantiate a Reformer model according to the specified arguments, defining the model architecture.
architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig` to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
for more information. for more information.
Args: Args:
attention_head_size (:obj:`int`, optional, defaults to 64): attention_head_size (:obj:`int`, `optional`, defaults to 64):
Dimensionality of the projected key, query and value vectors Dimensionality of the projected key, query and value vectors
attn_layers (:obj:`list(str)`, optional, defaults to ["local", "lsh", "local", "lsh", "local", "lsh"]): attn_layers (:obj:`List[str]`, `optional`, defaults to :obj:`["local", "lsh", "local", "lsh", "local", "lsh"]`):
List of attention layer types in ascending order. It can be chosen between a List of attention layer types in ascending order. It can be chosen between a
LSHSelfAttention layer ("lsh") and a LocalSelfAttention layer ("local"). LSHSelfAttention layer (:obj:`"lsh"`) and a LocalSelfAttention layer (:obj:`"local"`).
For more information on LSHSelfAttention layer, see `LSH Self Attention <reformer.html#lsh-self-attention>`__ .
For more information on LocalSelfAttention layer, see `Local Self Attention <reformer.html#local-sensitive-hashing-self-attention>`__ . For more information on LSHSelfAttention layer, see `LSH Self Attention
axial_pos_embds (:obj:`bool`, optional, defaults to :obj:`True`): <reformer.html#lsh-self-attention>`__. For more information on LocalSelfAttention layer, see `Local Self
If `True` use axial position embeddings. For more information on how axial position embeddings work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__ Attention <reformer.html#local-sensitive-hashing-self-attention>`__.
axial_norm_std (:obj:`float`, optional, defaluts to 1.0): axial_pos_embds (:obj:`bool`, `optional`, defaults to :obj:`True`):
The standard deviation of the normal_initializer for initializing the weight matrices of the axial positional encodings. Whether or not to use axial position embeddings. For more information on how axial position embeddings
axial_pos_shape (:obj:`list(int)`, optional, defaults to `[64, 64]`): work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__.
The position dims of the axial position encodings. axial_norm_std (:obj:`float`, `optional`, defaults to 1.0):
During training the product of the position dims has to equal the sequence length. The standard deviation of the normal_initializer for initializing the weight matrices of the axial
For more information on how axial position embeddings work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__. positional encodings.
axial_pos_embds_dim (:obj:`list(int)`, optional, defaults to `[64, 192]`): axial_pos_shape (:obj:`List[int]`, `optional`, defaults to :obj:`[64, 64]`):
The embedding dims of the axial position encodings. The position dims of the axial position encodings. During training the product of the position dims has to
The sum of the embedding dims has to equal the hidden size. be equal to the sequence length.
For more information on how axial position embeddings work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__.
chunk_size_lm_head (:obj:`int`, optional, defaults to 0): For more information on how axial position embeddings work, see `Axial Position Encodings
<reformer.html#axial-positional-encodings>`__.
axial_pos_embds_dim (:obj:`List[int]`, `optional`, defaults to :obj:`[64, 192]`):
The embedding dims of the axial position encodings. The sum of the embedding dims has to be equal to the
hidden size.
For more information on how axial position embeddings work, see `Axial Position Encodings
<reformer.html#axial-positional-encodings>`__.
chunk_size_lm_head (:obj:`int`, `optional`, defaults to 0):
The chunk size of the final language model feed forward head layer. The chunk size of the final language model feed forward head layer.
A chunk size of 0 means that the feed forward layer is not chunked. A chunk size of 0 means that the feed forward layer is not chunked.
A chunk size of n means that the feed forward layer processes n < sequence_length embeddings at a time. A chunk size of n means that the feed forward layer processes n < sequence_length embeddings at a time.
For more information on feed forward chunking, see `How does Feed Forward Chunking work? <../glossary.html#feed-forward-chunking>`__ .
eos_token_id (:obj:`int`, optional, defaults to 2): For more information on feed forward chunking, see `How does Feed Forward Chunking work?
The token id for the <EOS> token. <../glossary.html#feed-forward-chunking>`__.
feed_forward_size (:obj:`int`, optional, defaults to 512): eos_token_id (:obj:`int`, `optional`, defaults to 2):
Dimensionality of the "feed_forward" (i.e., feed-forward) layer in the residual attention block. The token id for the end-of-sentence token.
hash_seed (:obj:`int`, optional, defaults to `None`): feed_forward_size (:obj:`int`, `optional`, defaults to 512):
Seed that can be used to make local sensitive hashing in LSHSelfAttention deterministic. This should only be set for testing purposed. For evaluation and training purposes `hash_seed` should be set to `None` to ensure fully random rotations in local sensitive hashing scheme. Dimensionality of the feed_forward layer in the residual attention block.
hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "relu"): hash_seed (:obj:`int`, `optional`):
The non-linear activation function (function or string) in the feed forward layer in the residual attention block. Seed that can be used to make local sensitive hashing in :obj:`LSHSelfAttention` deterministic. This should
If string, "gelu", "relu", "swish", "gelu_new" and "gelu_fast" are supported. only be set for testing purposed. For evaluation and training purposes :obj:`hash_seed` should be left as
hidden_dropout_prob (:obj:`float`, optional, defaults to 0.05): :obj:`None` to ensure fully random rotations in local sensitive hashing scheme.
hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"relu"`):
The non-linear activation function (function or string) in the feed forward layer in the residual attention
block.
If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.05):
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler. The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
hidden_size (:obj:`int`, optional, defaults to 256): hidden_size (:obj:`int`, `optional`, defaults to 256):
Dimensionality of the output hidden states of the residual attention blocks. Dimensionality of the output hidden states of the residual attention blocks.
initializer_range (:obj:`float`, optional, defaults to 0.02): initializer_range (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
is_decoder (:obj:`bool`, optional, defaults to :obj:`False`): is_decoder (:obj:`bool`, `optional`, defaults to :obj:`False`):
If `is_decoder` is True, a causal mask is used in addition to `attention_mask`. Whether ot not to use a causal mask in addition to the :obj:`attention_mask` passed to
When using the Reformer for causal language modeling, `is_decoder` is set to `True`. :class:`~transformers.ReformerModel`. When using the Reformer for causal language modeling, this argument
layer_norm_eps (:obj:`float`, optional, defaults to 1e-12): should be set to :obj:`True`.
layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
The epsilon used by the layer normalization layers. The epsilon used by the layer normalization layers.
local_chunk_length (:obj:`int`, optional, defaults to 64): local_chunk_length (:obj:`int`, `optional`, defaults to 64):
Length of chunk which attends to itself in LocalSelfAttention. Chunking reduces memory complexity from sequence length x sequence length (self attention) to chunk length x chunk length x sequence length / chunk length (chunked self attention). Length of chunk which attends to itself in :obj:`LocalSelfAttention`. Chunking reduces memory complexity
local_num_chunks_before (:obj:`int`, optional, defaults to 1): from sequence length x sequence length (self attention) to
Number of previous neighbouring chunks to attend to in LocalSelfAttention layer to itself. chunk length x chunk length x sequence length / chunk length (chunked self attention).
local_num_chunks_after (:obj:`int`, optional, defaults to 0): local_num_chunks_before (:obj:`int`, `optional`, defaults to 1):
Number of following neighbouring chunks to attend to in LocalSelfAttention layer in addition to itself. Number of previous neighbouring chunks to attend to in :obj:`LocalSelfAttention` layer to itself.
local_attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1): local_num_chunks_after (:obj:`int`, `optional`, defaults to 0):
The dropout ratio for the attention probabilities in LocalSelfAttention. Number of following neighbouring chunks to attend to in :obj:`LocalSelfAttention` layer in addition to
lsh_attn_chunk_length (:obj:`int`, optional, defaults to 64): itself.
Length of chunk which attends to itself in LSHSelfAttention. Chunking reduces memory complexity from sequence length x sequence length (self attention) to chunk length x chunk length x sequence length / chunk length (chunked self attention). local_attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
lsh_num_chunks_before (:obj:`int`, optional, defaults to 1): The dropout ratio for the attention probabilities in :obj:`LocalSelfAttention`.
Number of previous neighbouring chunks to attend to in LSHSelfAttention layer to itself. lsh_attn_chunk_length (:obj:`int`, `optional`, defaults to 64):
lsh_num_chunks_after (:obj:`int`, optional, defaults to 0): Length of chunk which attends to itself in :obj:`LSHSelfAttention`. Chunking reduces memory complexity from
Number of following neighbouring chunks to attend to in LSHSelfAttention layer to itself. sequence length x sequence length (self attention) to
lsh_attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1): chunk length x chunk length x sequence length / chunk length (chunked self attention).
The dropout ratio for the attention probabilities in LSHSelfAttention. lsh_num_chunks_before (:obj:`int`, `optional`, defaults to 1):
max_position_embeddings (:obj:`int`, optional, defaults to 4096): Number of previous neighbouring chunks to attend to in :obj:`LSHSelfAttention` layer to itself.
lsh_num_chunks_after (:obj:`int`, `optional`, defaults to 0):
Number of following neighbouring chunks to attend to in :obj:`LSHSelfAttention` layer to itself.
lsh_attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout ratio for the attention probabilities in :obj:`LSHSelfAttention`.
max_position_embeddings (:obj:`int`, `optional`, defaults to 4096):
The maximum sequence length that this model might ever be used with. The maximum sequence length that this model might ever be used with.
Typically set this to something large just in case (e.g., 512 or 1024 or 2048). Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
num_attention_heads (:obj:`int`, optional, defaults to 12): num_attention_heads (:obj:`int`, `optional`, defaults to 12):
Number of attention heads for each attention layer in the Transformer encoder. Number of attention heads for each attention layer in the Transformer encoder.
num_buckets (:obj:`int` or :obj:`list(int)`, optional, defaults to `None`): num_buckets (:obj:`int` or :obj:`List[int]`, `optional`):
Number of buckets, the key query vectors can be "hashed into" using the locality sensitive hashing scheme. Each query key vector is hashed into a hash in `1, ..., num_buckets`. Number of buckets, the key query vectors can be "hashed into" using the locality sensitive hashing scheme.
The number of buckets can also be factorized into a list for improved memory complexity. In this case, each query key vector is hashed into a hash in `1-1, 1-2, ..., num_buckets[0]-1, ..., num_buckets[0]-num_buckets[1]` if `num_buckets` is factorized into two factors. Each query key vector is hashed into a hash in :obj:`1, ..., num_buckets`.
The number of buckets (or the product the factors) should approximately equal sequence length / lsh_chunk_length. If `num_buckets` is set to `None`, a good value for `num_buckets` is calculated on the fly. The number of buckets can also be factorized into a list for improved memory complexity. In this case, each
num_hashes (:obj:`int`, optional, defaults to 1): query key vector is hashed into a hash in
Number of hashing rounds (e.g. number of random rotations) in Local Sensitive Hashing scheme. :obj:`1-1, 1-2, ..., num_buckets[0]-1, ..., num_buckets[0]-num_buckets[1]` if :obj:`num_buckets` is
The higher `num_hashes`, the more accurate the `LSHSelfAttention` becomes, but also the more memory and time intensive the hashing becomes. factorized into two factors.
pad_token_id (:obj:`int`, optional, defaults to 0): The number of buckets (or the product the factors) should approximately equal
The token id for the <PAD> token. sequence length / lsh_chunk_length. If :obj:`num_buckets` not set, a good value is calculated on the fly.
vocab_size (:obj:`int`, optional, defaults to 320): num_hashes (:obj:`int`, `optional`, defaults to 1):
Vocabulary size of the Reformer model. Defines the different tokens that Number of hashing rounds (e.g., number of random rotations) in Local Sensitive Hashing scheme.
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.ReformerModel`. The higher :obj:`num_hashes`, the more accurate the :obj:`LSHSelfAttention` becomes, but also the more
memory and time intensive the hashing becomes.
pad_token_id (:obj:`int`, `optional`, defaults to 0):
The token id for the padding token.
vocab_size (:obj:`int`, `optional`, defaults to 320):\
Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
:obj:`inputs_ids` passed when calling :class:`~transformers.ReformerModel`.
tie_word_embeddings (:obj:`bool`, `optional`, defaults to :obj:`False`): tie_word_embeddings (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to tie input and output embeddings. Whether to tie input and output embeddings.
Example:: Examples::
>>> from transformers import ReformerModel, ReformerConfig >>> from transformers import ReformerModel, ReformerConfig
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment