Unverified Commit 3323146e authored by Sylvain Gugger's avatar Sylvain Gugger Committed by GitHub
Browse files

Models doc (#7345)



* Clean up model documentation

* Formatting

* Preparation work

* Long lines

* Main work on rst files

* Cleanup all config files

* Syntax fix

* Clean all tokenizers

* Work on first models

* Models beginning

* FaluBERT

* All PyTorch models

* All models

* Long lines again

* Fixes

* More fixes

* Update docs/source/model_doc/bert.rst
Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>

* Update docs/source/model_doc/electra.rst
Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>

* Last fixes
Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>
parent 58405a52
Tokenizer summary
-----------------
In this page, we will have a closer look at tokenization. As we saw in
:doc:`the preprocessing tutorial <preprocessing>`, tokenizing a text is splitting it into words or subwords, which then
are converted to ids. The second part is pretty straightforward, here we will focus on the first part. More
specifically, we will look at the three main different kinds of tokenizers used in 🤗 Transformers:
:ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>` and
:ref:`SentencePiece <sentencepiece>`, and provide examples of models using each of those.
Note that on each model page, you can look at the documentation of the associated tokenizer to know which of those
algorithms the pretrained model used. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see it's
using :ref:`WordPiece <wordpiece>`.
Introduction to tokenization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Splitting a text in smaller chunks is a task that's harder than it looks, and there are multiple ways of doing it. For
instance, let's look at the sentence "Don't you love 🤗 Transformers? We sure do." A first simple way of tokenizing
this text is just to split it by spaces, which would give:
::
["Don't", "you", "love", "🤗", "Transformers?", "We", "sure", "do."]
This is a nice first step, but if we look at the tokens "Transformers?" or "do.", we can see we can do better. Those
will be different than the tokens "Transformers" and "do" for our model, so we should probably take the punctuation
into account. This would give:
::
["Don", "'", "t", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
which is better already. One thing that is annoying though is how it dealt with "Don't". "Don't" stands for do not, so
it should probably be better tokenized as ``["Do", "n't"]``. This is where things start getting more complicated, and
part of the reason each kind of model has its own tokenizer class. Depending on the rules we apply to split our texts
into tokens, we'll get different tokenized versions of the same text. And of course, a given pretrained model won't
perform properly if you don't use the exact same rules as the persons who pretrained it.
`spaCy <https://spacy.io/>`__ and `Moses <http://www.statmt.org/moses/?n=Development.GetStarted>`__ are two popular
rule-based tokenizers. On the text above, they'd output something like:
::
["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
Space/punctuation-tokenization and rule-based tokenization are both examples of word tokenization, which is splitting a
sentence into words. While it's the most intuitive way to separate texts in smaller chunks, it can have a problem when
you have a huge corpus: it usually yields a very big vocabulary (the set of all unique tokens used).
:doc:`Transformer XL <model_doc/transformerxl>` for instance uses space/punctuation-tokenization, and has a vocabulary
size of 267,735!
A huge vocabulary size means a huge embedding matrix at the start of the model, which will cause memory problems.
TransformerXL deals with it by using a special kind of embeddings called adaptive embeddings, but in general,
transformers models rarely have a vocabulary size greater than 50,000, especially if they are trained on a single
language.
So if tokenizing on words is unsatisfactory, we could go on the opposite direction and simply tokenize on characters.
While it's very simple and would save a lot of memory, this doesn't allow the model to learn representations of texts
as meaningful as when using a word tokenization, leading to a loss of performance. So to get the best of both worlds,
all transformers models use a hybrid between word-level and character-level tokenization called subword tokenization.
Subword tokenization
^^^^^^^^^^^^^^^^^^^^
Subword tokenization algorithms rely on the principle that most common words should be left as is, but rare words
should be decomposed in meaningful subword units. For instance "annoyingly" might be considered a rare word and
decomposed as "annoying" and "ly". This is especially useful in agglutinative languages such as Turkish, where you can
form (almost) arbitrarily long complex words by stringing together some subwords.
This allows the model to keep a reasonable vocabulary while still learning useful representations for common words or
subwords. This also enables the model to process words it has never seen before, by decomposing them into
subwords it knows. For instance, the base :class:`~transformers.BertTokenizer` will tokenize "I have a new GPU!" like
this:
.. code-block::
>>> from transformers import BertTokenizer
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> tokenizer.tokenize("I have a new GPU!")
['i', 'have', 'a', 'new', 'gp', '##u', '!']
Since we are considering the uncased model, the sentence was lowercased first. Then all the words were present in the
vocabulary of the tokenizer, except for "gpu", so the tokenizer split it in subwords it knows: "gp" and "##u". The "##"
means that the rest of the token should be attached to the previous one, without space (for when we need to decode
predictions and reverse the tokenization).
Another example is when we use the base :class:`~transformers.XLNetTokenizer` to tokenize our previous text:
.. code-block::
>>> from transformers import XLNetTokenizer
>>> tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
>>> tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.")
['▁Don', "'", 't', '▁you', '▁love', '▁', '🤗', '▁', 'Transform', 'ers', '?', '▁We', '▁sure', '▁do', '.']
We'll get back to the meaning of those '▁' when we look at :ref:`SentencePiece <sentencepiece>` but you can see
Transformers has been split into "Transform" and "ers".
Let's now look at how the different subword tokenization algorithms work. Note that they all rely on some form of
training which is usually done on the corpus the corresponding model will be trained on.
.. _byte-pair-encoding:
Byte-Pair Encoding
~~~~~~~~~~~~~~~~~~
Byte-Pair Encoding was introduced in `this paper <https://arxiv.org/abs/1508.07909>`__. It relies on a pretokenizer
splitting the training data into words, which can be a simple space tokenization
(:doc:`GPT-2 <model_doc/gpt2>` and :doc:`Roberta <model_doc/roberta>` uses this for instance) or a rule-based tokenizer
(:doc:`XLM <model_doc/xlm>` use Moses for most languages, as does :doc:`FlauBERT <model_doc/flaubert>`),
:doc:`GPT <model_doc/gpt>` uses Spacy and ftfy, and counts the frequency of each word in the training corpus.
It then begins from the list of all characters, and will learn merge rules to form a new token from two symbols in the
vocabulary until it has learned a vocabulary of the desired size (this is a hyperparameter to pick).
Let's say that after the pre-tokenization we have the following words (the number indicating the frequency of each
word):
::
('hug', 10), ('pug', 5), ('pun', 12), ('bun', 4), ('hugs', 5)
Then the base vocabulary is ['b', 'g', 'h', 'n', 'p', 's', 'u'] and all our words are first split by character:
::
('h' 'u' 'g', 10), ('p' 'u' 'g', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'u' 'g' 's', 5)
We then take each pair of symbols and look at the most frequent. For instance 'hu' is present `10 + 5 = 15` times (10
times in the 10 occurrences of 'hug', 5 times in the 5 occurrences of 'hugs'). The most frequent here is 'ug', present
`10 + 5 + 5 = 20` times in total. So the first merge rule the tokenizer learns is to group all 'u' and 'g' together
then it adds 'ug' to the vocabulary. Our corpus then becomes
::
('h' 'ug', 10), ('p' 'ug', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'ug' 's', 5)
and we continue by looking at the next most common pair of symbols. It's 'un', present 16 times, so we merge those two
and add 'un' to the vocabulary. Then it's 'hug' (as 'h' + 'ug'), present 15 times, so we merge those two and add 'hug'
to the vocabulary.
At this stage, the vocabulary is ``['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug']`` and our corpus is
represented as
::
('hug', 10), ('p' 'ug', 5), ('p' 'un', 12), ('b' 'un', 4), ('hug' 's', 5)
If we stop there, the tokenizer can apply the rules it learned to new words (as long as they don't contain characters that
were not in the base vocabulary). For instance 'bug' would be tokenized as ``['b', 'ug']`` but mug would be tokenized as
``['<unk>', 'ug']`` since the 'm' is not in the base vocabulary. This doesn't happen to letters in general (since the
base corpus uses all of them), but to special characters like emojis.
As we said before, the vocabulary size (which is the base vocabulary size + the number of merges) is a hyperparameter
to choose. For instance :doc:`GPT <model_doc/gpt>` has a vocabulary size of 40,478 since they have 478 base characters
and chose to stop the training of the tokenizer at 40,000 merges.
Byte-level BPE
^^^^^^^^^^^^^^
To deal with the fact the base vocabulary needs to get all base characters, which can be quite big if one allows for
all unicode characters, the
`GPT-2 paper <https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__
introduces a clever trick, which is to use bytes as the base vocabulary (which gives a size of 256). With some
additional rules to deal with punctuation, this manages to be able to tokenize every text without needing an unknown
token. For instance, the :doc:`GPT-2 model <model_doc/gpt>` has a vocabulary size of 50,257, which corresponds to the
256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges.
.. _wordpiece:
WordPiece
=========
WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>` (as well as
:doc:`DistilBERT <model_doc/distilbert>` and :doc:`Electra <model_doc/electra>`) and was outlined in
`this paper <https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__. It relies
on the same base as BPE, which is to initialize the vocabulary to every character present in the corpus and
progressively learn a given number of merge rules, the difference is that it doesn't choose the pair that is the most
frequent but the one that will maximize the likelihood on the corpus once merged.
What does this mean? Well, in the previous example, it means we would only merge 'u' and 'g' if the probability of
having 'ug' divided by the probability of having 'u' then 'g' is greater than for any other pair of symbols. It's
subtly different from what BPE does in the sense that it evaluates what it "loses" by merging two symbols and makes
sure it's `worth it`.
.. _unigram:
Unigram
=======
Unigram is a subword tokenization algorithm introduced in `this paper <https://arxiv.org/pdf/1804.10959.pdf>`__.
Instead of starting with a group of base symbols and learning merges with some rule, like BPE or WordPiece, it starts
from a large vocabulary (for instance, all pretokenized words and the most common substrings) that it will trim down
progressively. It's not used directly for any of the pretrained models in the library, but it's used in conjunction
with :ref:`SentencePiece <sentencepiece>`.
More specifically, at a given step, unigram computes a loss from the corpus we have and the current vocabulary, then,
for each subword, evaluate how much the loss would augment if the subword was removed from the vocabulary. It then
sorts the subwords by this quantity (that represents how worse the loss becomes if the token is removed) and removes
all the worst p tokens (for instance p could be 10% or 20%). It then repeats the process until the vocabulary has
reached the desired size, always keeping the base characters (to be able to tokenize any word written with them, like
BPE or WordPiece).
Contrary to BPE and WordPiece that work out rules in a certain order that you can then apply in the same order when
tokenizing new text, Unigram will have several ways of tokenizing a new text. For instance, if it ends up with the
vocabulary
::
['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug']
we had before, it could tokenize "hugs" as ``['hug', 's']``, ``['h', 'ug', 's']`` or ``['h', 'u', 'g', 's']``. So which
one choose? On top of saving the vocabulary, the trained tokenizer will save the probability of each token in the
training corpus. You can then give a probability to each tokenization (which is the product of the probabilities of the
tokens forming it) and pick the most likely one (or if you want to apply some data augmentation, you could sample one
of the tokenization according to their probabilities).
Those probabilities define the loss that trains the tokenizer: if our corpus consists of the
words :math:`x_{1}, \dots, x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all possible
tokenizations of :math:`x_{i}` (with the current vocabulary), then the loss is defined as
.. math::
\mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )
.. _sentencepiece:
SentencePiece
=============
All the methods we have been looking at so far required some form of pretokenization, which has a central problem: not
all languages use spaces to separate words. This is a problem :doc:`XLM <model_doc/xlm>` solves by using specific
pretokenizers for each of those languages (in this case, Chinese, Japanese and Thai). To solve this problem,
SentencePiece (introduced in `this paper <https://arxiv.org/pdf/1808.06226.pdf>`__) treats the input as a raw stream,
includes the space in the set of characters to use, then uses BPE or unigram to construct the appropriate vocabulary.
That's why in the example we saw before using :class:`~transformers.XLNetTokenizer` (which uses SentencePiece), we had
the '▁' character, that represents space. Decoding a tokenized text is then super easy: we just have to concatenate
all of them together and replace '▁' with space.
All transformers models in the library that use SentencePiece use it with unigram. Examples of models using it are
:doc:`ALBERT <model_doc/albert>`, :doc:`XLNet <model_doc/xlnet>` or the :doc:`Marian framework <model_doc/marian>`.
Tokenizer summary
-----------------------------------------------------------------------------------------------------------------------
In this page, we will have a closer look at tokenization. As we saw in
:doc:`the preprocessing tutorial <preprocessing>`, tokenizing a text is splitting it into words or subwords, which then
are converted to ids. The second part is pretty straightforward, here we will focus on the first part. More
specifically, we will look at the three main different kinds of tokenizers used in 🤗 Transformers:
:ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>` and
:ref:`SentencePiece <sentencepiece>`, and provide examples of models using each of those.
Note that on each model page, you can look at the documentation of the associated tokenizer to know which of those
algorithms the pretrained model used. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see it's
using :ref:`WordPiece <wordpiece>`.
Introduction to tokenization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Splitting a text in smaller chunks is a task that's harder than it looks, and there are multiple ways of doing it. For
instance, let's look at the sentence "Don't you love 🤗 Transformers? We sure do." A first simple way of tokenizing
this text is just to split it by spaces, which would give:
.. code-block::
["Don't", "you", "love", "🤗", "Transformers?", "We", "sure", "do."]
This is a nice first step, but if we look at the tokens "Transformers?" or "do.", we can see we can do better. Those
will be different than the tokens "Transformers" and "do" for our model, so we should probably take the punctuation
into account. This would give:
.. code-block::
["Don", "'", "t", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
which is better already. One thing that is annoying though is how it dealt with "Don't". "Don't" stands for do not, so
it should probably be better tokenized as ``["Do", "n't"]``. This is where things start getting more complicated, and
part of the reason each kind of model has its own tokenizer class. Depending on the rules we apply to split our texts
into tokens, we'll get different tokenized versions of the same text. And of course, a given pretrained model won't
perform properly if you don't use the exact same rules as the persons who pretrained it.
`spaCy <https://spacy.io/>`__ and `Moses <http://www.statmt.org/moses/?n=Development.GetStarted>`__ are two popular
rule-based tokenizers. On the text above, they'd output something like:
.. code-block::
["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
Space/punctuation-tokenization and rule-based tokenization are both examples of word tokenization, which is splitting a
sentence into words. While it's the most intuitive way to separate texts in smaller chunks, it can have a problem when
you have a huge corpus: it usually yields a very big vocabulary (the set of all unique tokens used).
:doc:`Transformer XL <model_doc/transformerxl>` for instance uses space/punctuation-tokenization, and has a vocabulary
size of 267,735!
A huge vocabulary size means a huge embedding matrix at the start of the model, which will cause memory problems.
TransformerXL deals with it by using a special kind of embeddings called adaptive embeddings, but in general,
transformers models rarely have a vocabulary size greater than 50,000, especially if they are trained on a single
language.
So if tokenizing on words is unsatisfactory, we could go on the opposite direction and simply tokenize on characters.
While it's very simple and would save a lot of memory, this doesn't allow the model to learn representations of texts
as meaningful as when using a word tokenization, leading to a loss of performance. So to get the best of both worlds,
all transformers models use a hybrid between word-level and character-level tokenization called subword tokenization.
Subword tokenization
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Subword tokenization algorithms rely on the principle that most common words should be left as is, but rare words
should be decomposed in meaningful subword units. For instance "annoyingly" might be considered a rare word and
decomposed as "annoying" and "ly". This is especially useful in agglutinative languages such as Turkish, where you can
form (almost) arbitrarily long complex words by stringing together some subwords.
This allows the model to keep a reasonable vocabulary while still learning useful representations for common words or
subwords. This also enables the model to process words it has never seen before, by decomposing them into
subwords it knows. For instance, the base :class:`~transformers.BertTokenizer` will tokenize "I have a new GPU!" like
this:
.. code-block::
>>> from transformers import BertTokenizer
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> tokenizer.tokenize("I have a new GPU!")
['i', 'have', 'a', 'new', 'gp', '##u', '!']
Since we are considering the uncased model, the sentence was lowercased first. Then all the words were present in the
vocabulary of the tokenizer, except for "gpu", so the tokenizer split it in subwords it knows: "gp" and "##u". The "##"
means that the rest of the token should be attached to the previous one, without space (for when we need to decode
predictions and reverse the tokenization).
Another example is when we use the base :class:`~transformers.XLNetTokenizer` to tokenize our previous text:
.. code-block::
>>> from transformers import XLNetTokenizer
>>> tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
>>> tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.")
['▁Don', "'", 't', '▁you', '▁love', '▁', '🤗', '▁', 'Transform', 'ers', '?', '▁We', '▁sure', '▁do', '.']
We'll get back to the meaning of those '▁' when we look at :ref:`SentencePiece <sentencepiece>` but you can see
Transformers has been split into "Transform" and "ers".
Let's now look at how the different subword tokenization algorithms work. Note that they all rely on some form of
training which is usually done on the corpus the corresponding model will be trained on.
.. _byte-pair-encoding:
Byte-Pair Encoding
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Byte-Pair Encoding was introduced in `this paper <https://arxiv.org/abs/1508.07909>`__. It relies on a pretokenizer
splitting the training data into words, which can be a simple space tokenization
(:doc:`GPT-2 <model_doc/gpt2>` and :doc:`Roberta <model_doc/roberta>` uses this for instance) or a rule-based tokenizer
(:doc:`XLM <model_doc/xlm>` use Moses for most languages, as does :doc:`FlauBERT <model_doc/flaubert>`),
:doc:`GPT <model_doc/gpt>` uses Spacy and ftfy, and counts the frequency of each word in the training corpus.
It then begins from the list of all characters, and will learn merge rules to form a new token from two symbols in the
vocabulary until it has learned a vocabulary of the desired size (this is a hyperparameter to pick).
Let's say that after the pre-tokenization we have the following words (the number indicating the frequency of each
word):
.. code-block::
('hug', 10), ('pug', 5), ('pun', 12), ('bun', 4), ('hugs', 5)
Then the base vocabulary is ['b', 'g', 'h', 'n', 'p', 's', 'u'] and all our words are first split by character:
.. code-block::
('h' 'u' 'g', 10), ('p' 'u' 'g', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'u' 'g' 's', 5)
We then take each pair of symbols and look at the most frequent. For instance 'hu' is present `10 + 5 = 15` times (10
times in the 10 occurrences of 'hug', 5 times in the 5 occurrences of 'hugs'). The most frequent here is 'ug', present
`10 + 5 + 5 = 20` times in total. So the first merge rule the tokenizer learns is to group all 'u' and 'g' together
then it adds 'ug' to the vocabulary. Our corpus then becomes
.. code-block::
('h' 'ug', 10), ('p' 'ug', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'ug' 's', 5)
and we continue by looking at the next most common pair of symbols. It's 'un', present 16 times, so we merge those two
and add 'un' to the vocabulary. Then it's 'hug' (as 'h' + 'ug'), present 15 times, so we merge those two and add 'hug'
to the vocabulary.
At this stage, the vocabulary is ``['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug']`` and our corpus is
represented as
.. code-block::
('hug', 10), ('p' 'ug', 5), ('p' 'un', 12), ('b' 'un', 4), ('hug' 's', 5)
If we stop there, the tokenizer can apply the rules it learned to new words (as long as they don't contain characters that
were not in the base vocabulary). For instance 'bug' would be tokenized as ``['b', 'ug']`` but mug would be tokenized as
``['<unk>', 'ug']`` since the 'm' is not in the base vocabulary. This doesn't happen to letters in general (since the
base corpus uses all of them), but to special characters like emojis.
As we said before, the vocabulary size (which is the base vocabulary size + the number of merges) is a hyperparameter
to choose. For instance :doc:`GPT <model_doc/gpt>` has a vocabulary size of 40,478 since they have 478 base characters
and chose to stop the training of the tokenizer at 40,000 merges.
Byte-level BPE
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To deal with the fact the base vocabulary needs to get all base characters, which can be quite big if one allows for
all unicode characters, the
`GPT-2 paper <https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__
introduces a clever trick, which is to use bytes as the base vocabulary (which gives a size of 256). With some
additional rules to deal with punctuation, this manages to be able to tokenize every text without needing an unknown
token. For instance, the :doc:`GPT-2 model <model_doc/gpt>` has a vocabulary size of 50,257, which corresponds to the
256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges.
.. _wordpiece:
WordPiece
=======================================================================================================================
WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>` (as well as
:doc:`DistilBERT <model_doc/distilbert>` and :doc:`Electra <model_doc/electra>`) and was outlined in
`this paper <https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__. It relies
on the same base as BPE, which is to initialize the vocabulary to every character present in the corpus and
progressively learn a given number of merge rules, the difference is that it doesn't choose the pair that is the most
frequent but the one that will maximize the likelihood on the corpus once merged.
What does this mean? Well, in the previous example, it means we would only merge 'u' and 'g' if the probability of
having 'ug' divided by the probability of having 'u' then 'g' is greater than for any other pair of symbols. It's
subtly different from what BPE does in the sense that it evaluates what it "loses" by merging two symbols and makes
sure it's `worth it`.
.. _unigram:
Unigram
=======================================================================================================================
Unigram is a subword tokenization algorithm introduced in `this paper <https://arxiv.org/pdf/1804.10959.pdf>`__.
Instead of starting with a group of base symbols and learning merges with some rule, like BPE or WordPiece, it starts
from a large vocabulary (for instance, all pretokenized words and the most common substrings) that it will trim down
progressively. It's not used directly for any of the pretrained models in the library, but it's used in conjunction
with :ref:`SentencePiece <sentencepiece>`.
More specifically, at a given step, unigram computes a loss from the corpus we have and the current vocabulary, then,
for each subword, evaluate how much the loss would augment if the subword was removed from the vocabulary. It then
sorts the subwords by this quantity (that represents how worse the loss becomes if the token is removed) and removes
all the worst p tokens (for instance p could be 10% or 20%). It then repeats the process until the vocabulary has
reached the desired size, always keeping the base characters (to be able to tokenize any word written with them, like
BPE or WordPiece).
Contrary to BPE and WordPiece that work out rules in a certain order that you can then apply in the same order when
tokenizing new text, Unigram will have several ways of tokenizing a new text. For instance, if it ends up with the
vocabulary
.. code-block::
['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug']
we had before, it could tokenize "hugs" as ``['hug', 's']``, ``['h', 'ug', 's']`` or ``['h', 'u', 'g', 's']``. So which
one choose? On top of saving the vocabulary, the trained tokenizer will save the probability of each token in the
training corpus. You can then give a probability to each tokenization (which is the product of the probabilities of the
tokens forming it) and pick the most likely one (or if you want to apply some data augmentation, you could sample one
of the tokenization according to their probabilities).
Those probabilities define the loss that trains the tokenizer: if our corpus consists of the
words :math:`x_{1}, \dots, x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all possible
tokenizations of :math:`x_{i}` (with the current vocabulary), then the loss is defined as
.. math::
\mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )
.. _sentencepiece:
SentencePiece
=======================================================================================================================
All the methods we have been looking at so far required some form of pretokenization, which has a central problem: not
all languages use spaces to separate words. This is a problem :doc:`XLM <model_doc/xlm>` solves by using specific
pretokenizers for each of those languages (in this case, Chinese, Japanese and Thai). To solve this problem,
SentencePiece (introduced in `this paper <https://arxiv.org/pdf/1808.06226.pdf>`__) treats the input as a raw stream,
includes the space in the set of characters to use, then uses BPE or unigram to construct the appropriate vocabulary.
That's why in the example we saw before using :class:`~transformers.XLNetTokenizer` (which uses SentencePiece), we had
the '▁' character, that represents space. Decoding a tokenized text is then super easy: we just have to concatenate
all of them together and replace '▁' with space.
All transformers models in the library that use SentencePiece use it with unigram. Examples of models using it are
:doc:`ALBERT <model_doc/albert>`, :doc:`XLNet <model_doc/xlnet>` or the :doc:`Marian framework <model_doc/marian>`.
Training and fine-tuning
========================
=======================================================================================================================
Model classes in 🤗 Transformers are designed to be compatible with native
PyTorch and TensorFlow 2 and can be used seemlessly with either. In this
......@@ -24,7 +24,7 @@ Sections:
.. _pytorch:
Fine-tuning in native PyTorch
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Model classes in 🤗 Transformers that don't begin with ``TF`` are
`PyTorch Modules <https://pytorch.org/docs/master/generated/torch.nn.Module.html>`_,
......@@ -141,7 +141,7 @@ with features like mixed precision and easy tensorboard logging.
Freezing the encoder
--------------------
-----------------------------------------------------------------------------------------------------------------------
In some cases, you might be interested in keeping the weights of the
pre-trained encoder frozen and optimizing only the weights of the head
......@@ -158,7 +158,7 @@ submodule on any task-specific model in the library:
.. _tensorflow:
Fine-tuning in native TensorFlow 2
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Models can also be trained natively in TensorFlow 2. Just as with PyTorch,
TensorFlow models can be instantiated with
......@@ -210,7 +210,7 @@ can even save the model and then reload it as a PyTorch model (or vice-versa):
.. _trainer:
Trainer
^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
We also provide a simple but feature-complete training and evaluation
interface through :func:`~transformers.Trainer` and
......@@ -303,7 +303,7 @@ launching tensorboard in your specified ``logging_dir`` directory.
.. _additional-resources:
Additional resources
^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- `A lightweight colab demo <https://colab.research.google.com/drive/1-JIJlao4dI-Ilww_NnTc0rxtp-ymgDgM?usp=sharing>`_
which uses ``Trainer`` for IMDb sentiment classification.
......
......@@ -32,54 +32,55 @@ ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class AlbertConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a :class:`~transformers.AlbertModel`.
It is used to instantiate an ALBERT model according to the specified arguments, defining the model
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
the ALBERT `xxlarge <https://huggingface.co/albert-xxlarge-v2>`__ architecture.
This is the configuration class to store the configuration of a :class:`~transformers.AlbertModel` or a
:class:`~transformers.TFAlbertModel`. It is used to instantiate an ALBERT model according to the specified
arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
configuration to that of the ALBERT `xxlarge <https://huggingface.co/albert-xxlarge-v2>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
for more information.
Args:
vocab_size (:obj:`int`, optional, defaults to 30000):
Vocabulary size of the ALBERT model. Defines the different tokens that
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.AlbertModel`.
embedding_size (:obj:`int`, optional, defaults to 128):
vocab_size (:obj:`int`, `optional`, defaults to 30000):
Vocabulary size of the ALBERT model. Defines the number of different tokens that can be represented by the
:obj:`inputs_ids` passed when calling :class:`~transformers.AlbertModel` or
:class:`~transformers.TFAlbertModel`.
embedding_size (:obj:`int`, `optional`, defaults to 128):
Dimensionality of vocabulary embeddings.
hidden_size (:obj:`int`, optional, defaults to 4096):
hidden_size (:obj:`int`, `optional`, defaults to 4096):
Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (:obj:`int`, optional, defaults to 12):
num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
Number of hidden layers in the Transformer encoder.
num_hidden_groups (:obj:`int`, optional, defaults to 1):
num_hidden_groups (:obj:`int`, `optional`, defaults to 1):
Number of groups for the hidden layers, parameters in the same group are shared.
num_attention_heads (:obj:`int`, optional, defaults to 64):
num_attention_heads (:obj:`int`, `optional`, defaults to 64):
Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (:obj:`int`, optional, defaults to 16384):
The dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
inner_group_num (:obj:`int`, optional, defaults to 1):
intermediate_size (:obj:`int`, `optional`, defaults to 16384):
The dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
inner_group_num (:obj:`int`, `optional`, defaults to 1):
The number of inner repetition of attention and ffn.
hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "gelu_new"):
hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu_new"`):
The non-linear activation function (function or string) in the encoder and pooler.
If string, "gelu", "relu", "swish" and "gelu_new" are supported.
hidden_dropout_prob (:obj:`float`, optional, defaults to 0):
If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0):
attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0):
The dropout ratio for the attention probabilities.
max_position_embeddings (:obj:`int`, optional, defaults to 512):
max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
The maximum sequence length that this model might ever be used with. Typically set this to something
large (e.g., 512 or 1024 or 2048).
type_vocab_size (:obj:`int`, optional, defaults to 2):
The vocabulary size of the `token_type_ids` passed into :class:`~transformers.AlbertModel`.
initializer_range (:obj:`float`, optional, defaults to 0.02):
type_vocab_size (:obj:`int`, `optional`, defaults to 2):
The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.AlbertModel` or
:class:`~transformers.TFAlbertModel`.
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
The epsilon used by the layer normalization layers.
classifier_dropout_prob (:obj:`float`, optional, defaults to 0.1):
classifier_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout ratio for attached classifiers.
Example::
Examples::
>>> from transformers import AlbertConfig, AlbertModel
>>> # Initializing an ALBERT-xxlarge style configuration
......
......@@ -50,10 +50,10 @@ BERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class BertConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a :class:`~transformers.BertModel`.
It is used to instantiate an BERT model according to the specified arguments, defining the model
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
the BERT `bert-base-uncased <https://huggingface.co/bert-base-uncased>`__ architecture.
This is the configuration class to store the configuration of a :class:`~transformers.BertModel` or a
:class:`~transformers.TFBertModel`. It is used to instantiate a BERT model according to the specified
arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
configuration to that of the BERT `bert-base-uncased <https://huggingface.co/bert-base-uncased>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
......@@ -61,37 +61,39 @@ class BertConfig(PretrainedConfig):
Args:
vocab_size (:obj:`int`, optional, defaults to 30522):
Vocabulary size of the BERT model. Defines the different tokens that
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.BertModel`.
hidden_size (:obj:`int`, optional, defaults to 768):
vocab_size (:obj:`int`, `optional`, defaults to 30522):
Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
:obj:`inputs_ids` passed when calling :class:`~transformers.BertModel` or
:class:`~transformers.TFBertModel`.
hidden_size (:obj:`int`, `optional`, defaults to 768):
Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (:obj:`int`, optional, defaults to 12):
num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
Number of hidden layers in the Transformer encoder.
num_attention_heads (:obj:`int`, optional, defaults to 12):
num_attention_heads (:obj:`int`, `optional`, defaults to 12):
Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (:obj:`int`, optional, defaults to 3072):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "gelu"):
intermediate_size (:obj:`int`, `optional`, defaults to 3072):
Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler.
If string, "gelu", "relu", "swish" and "gelu_new" are supported.
hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1):
If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout ratio for the attention probabilities.
max_position_embeddings (:obj:`int`, optional, defaults to 512):
max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
The maximum sequence length that this model might ever be used with.
Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (:obj:`int`, optional, defaults to 2):
The vocabulary size of the `token_type_ids` passed into :class:`~transformers.BertModel`.
initializer_range (:obj:`float`, optional, defaults to 0.02):
type_vocab_size (:obj:`int`, `optional`, defaults to 2):
The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.BertModel` or
:class:`~transformers.TFBertModel`.
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
The epsilon used by the layer normalization layers.
gradient_checkpointing (:obj:`bool`, optional, defaults to :obj:`False`):
gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
If True, use gradient checkpointing to save memory at the expense of slower backward pass.
Example::
Examples::
>>> from transformers import BertModel, BertConfig
......
......@@ -19,18 +19,18 @@ from .configuration_utils import PretrainedConfig
class BertGenerationConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a :class:`~transformers.BertGenerationPreTrainedModel`.
It is used to instantiate a BertGenerationConfig model according to the specified arguments, defining the model architecture.
This is the configuration class to store the configuration of a
:class:`~transformers.BertGenerationPreTrainedModel`. It is used to instantiate a BertGeneration model according to
the specified arguments, defining the model architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
for more information.
Args:
vocab_size (:obj:`int`, `optional`, defaults to 50358):
Vocabulary size of the BertGeneration model. Defines the different tokens that
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.BertGeneration`.
Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
:obj:`inputs_ids` passed when calling :class:`~transformers.BertGeneration`.
hidden_size (:obj:`int`, `optional`, defaults to 1024):
Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (:obj:`int`, `optional`, defaults to 24):
......@@ -38,7 +38,7 @@ class BertGenerationConfig(PretrainedConfig):
num_attention_heads (:obj:`int`, `optional`, defaults to 16):
Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (:obj:`int`, `optional`, defaults to 3072):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
Dimensionality of the "intermediate" (often called feed-forward) layer in the Transformer encoder.
hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler.
If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
......@@ -56,7 +56,7 @@ class BertGenerationConfig(PretrainedConfig):
gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
If :obj:`True`, use gradient checkpointing to save memory at the expense of slower backward pass.
Example::
Examples::
>>> from transformers import BertGenerationConfig, BertGenerationEncoder
......
......@@ -25,44 +25,45 @@ CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP = {"ctrl": "https://s3.amazonaws.com/models.h
class CTRLConfig(PretrainedConfig):
"""
This is the configuration class to store the configuration of a :class:`~transformers.CTRLModel`.
It is used to instantiate an CTRL model according to the specified arguments, defining the model
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
the `ctrl <https://huggingface.co/ctrl>`__ architecture from SalesForce.
This is the configuration class to store the configuration of a :class:`~transformers.CTRLModel` or a
:class:`~transformers.TFCTRLModel`. It is used to instantiate a CTRL model according to the specified
arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
configuration to that of the `ctrl <https://huggingface.co/ctrl>`__ architecture from SalesForce.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
for more information.
Args:
vocab_size (:obj:`int`, optional, defaults to 246534):
Vocabulary size of the CTRL model. Defines the different tokens that
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.CTRLModel`.
n_positions (:obj:`int`, optional, defaults to 256):
vocab_size (:obj:`int`, `optional`, defaults to 246534):
Vocabulary size of the CTRL model. Defines the number of different tokens that can be represented by the
:obj:`inputs_ids` passed when calling :class:`~transformers.CTRLModel` or
:class:`~transformers.TFCTRLModel`.
n_positions (:obj:`int`, `optional`, defaults to 256):
The maximum sequence length that this model might ever be used with.
Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
n_ctx (:obj:`int`, optional, defaults to 256):
n_ctx (:obj:`int`, `optional`, defaults to 256):
Dimensionality of the causal mask (usually same as n_positions).
n_embd (:obj:`int`, optional, defaults to 1280):
n_embd (:obj:`int`, `optional`, defaults to 1280):
Dimensionality of the embeddings and hidden states.
dff (:obj:`int`, optional, defaults to 8192):
Dimensionality of the inner dimension of the FFN.
n_layer (:obj:`int`, optional, defaults to 48):
dff (:obj:`int`, `optional`, defaults to 8192):
Dimensionality of the inner dimension of the feed forward networks (FFN).
n_layer (:obj:`int`, `optional`, defaults to 48):
Number of hidden layers in the Transformer encoder.
n_head (:obj:`int`, optional, defaults to 16):
n_head (:obj:`int`, `optional`, defaults to 16):
Number of attention heads for each attention layer in the Transformer encoder.
resid_pdrop (:obj:`float`, optional, defaults to 0.1):
resid_pdrop (:obj:`float`, `optional`, defaults to 0.1):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
embd_pdrop (:obj:`int`, optional, defaults to 0.1):
embd_pdrop (:obj:`int`, `optional`, defaults to 0.1):
The dropout ratio for the embeddings.
attn_pdrop (:obj:`float`, optional, defaults to 0.1):
attn_pdrop (:obj:`float`, `optional`, defaults to 0.1):
The dropout ratio for the attention.
layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-6):
layer_norm_epsilon (:obj:`float`, `optional`, defaults to 1e-6):
The epsilon to use in the layer normalization layers
initializer_range (:obj:`float`, optional, defaults to 0.02):
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
Example::
Examples::
>>> from transformers import CTRLModel, CTRLConfig
......
......@@ -33,50 +33,51 @@ DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class DistilBertConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a :class:`~transformers.DistilBertModel`.
It is used to instantiate a DistilBERT model according to the specified arguments, defining the model
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
the DistilBERT `distilbert-base-uncased <https://huggingface.co/distilbert-base-uncased>`__ architecture.
This is the configuration class to store the configuration of a :class:`~transformers.DistilBertModel` or a
:class:`~transformers.TFDistilBertModel`. It is used to instantiate a DistilBERT model according to the specified
arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
configuration to that of the DistilBERT
`distilbert-base-uncased <https://huggingface.co/distilbert-base-uncased>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
for more information.
Args:
vocab_size (:obj:`int`, optional, defaults to 30522):
Vocabulary size of the DistilBERT model. Defines the different tokens that
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.BertModel`.
max_position_embeddings (:obj:`int`, optional, defaults to 512):
vocab_size (:obj:`int`, `optional`, defaults to 30522):
Vocabulary size of the DistilBERT model. Defines the number of different tokens that can be represented by the
:obj:`inputs_ids` passed when calling :class:`~transformers.DistilBertModel` or
:class:`~transformers.TFDistilBertModel`.
max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
The maximum sequence length that this model might ever be used with.
Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
sinusoidal_pos_embds (:obj:`boolean`, optional, defaults to :obj:`False`):
sinusoidal_pos_embds (:obj:`boolean`, `optional`, defaults to :obj:`False`):
Whether to use sinusoidal positional embeddings.
n_layers (:obj:`int`, optional, defaults to 6):
n_layers (:obj:`int`, `optional`, defaults to 6):
Number of hidden layers in the Transformer encoder.
n_heads (:obj:`int`, optional, defaults to 12):
n_heads (:obj:`int`, `optional`, defaults to 12):
Number of attention heads for each attention layer in the Transformer encoder.
dim (:obj:`int`, optional, defaults to 768):
dim (:obj:`int`, `optional`, defaults to 768):
Dimensionality of the encoder layers and the pooler layer.
hidden_dim (:obj:`int`, optional, defaults to 3072):
The size of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
dropout (:obj:`float`, optional, defaults to 0.1):
hidden_dim (:obj:`int`, `optional`, defaults to 3072):
The size of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
dropout (:obj:`float`, `optional`, defaults to 0.1):
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (:obj:`float`, optional, defaults to 0.1):
attention_dropout (:obj:`float`, `optional`, defaults to 0.1):
The dropout ratio for the attention probabilities.
activation (:obj:`str` or :obj:`function`, optional, defaults to "gelu"):
activation (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler.
If string, "gelu", "relu", "swish" and "gelu_new" are supported.
initializer_range (:obj:`float`, optional, defaults to 0.02):
If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
qa_dropout (:obj:`float`, optional, defaults to 0.1):
qa_dropout (:obj:`float`, `optional`, defaults to 0.1):
The dropout probabilities used in the question answering model
:class:`~transformers.DistilBertForQuestionAnswering`.
seq_classif_dropout (:obj:`float`, optional, defaults to 0.2):
seq_classif_dropout (:obj:`float`, `optional`, defaults to 0.2):
The dropout probabilities used in the sequence classification and the multiple choice model
:class:`~transformers.DistilBertForSequenceClassification`.
Example::
Examples::
>>> from transformers import DistilBertModel, DistilBertConfig
......
......@@ -32,8 +32,12 @@ class DPRConfig(PretrainedConfig):
:class:`~transformers.DPRConfig` is the configuration class to store the configuration of a
`DPRModel`.
This is the configuration class to store the configuration of a `DPRContextEncoder`, `DPRQuestionEncoder`, or a `DPRReader`.
It is used to instantiate the components of the DPR model.
This is the configuration class to store the configuration of a :class:`~transformers.DPRContextEncoder`,
:class:`~transformers.DPRQuestionEncoder`, or a :class:`~transformers.DPRReader`. It is used to instantiate the
components of the DPR model.
This class is a subclass of :class:`~transformers.BertConfig`. Please check the
superclass for the documentation of all kwargs.
Args:
vocab_size (:obj:`int`, `optional`, defaults to 30522):
......
......@@ -33,11 +33,11 @@ ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class ElectraConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a :class:`~transformers.ElectraModel`.
It is used to instantiate an ELECTRA model according to the specified arguments, defining the model
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
the ELECTRA `google/electra-small-discriminator <https://huggingface.co/google/electra-small-discriminator>`__
architecture.
This is the configuration class to store the configuration of a :class:`~transformers.ElectraModel` or a
:class:`~transformers.TFElectraModel`. It is used to instantiate a ELECTRA model according to the specified
arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
configuration to that of the ELECTRA
`google/electra-small-discriminator <https://huggingface.co/google/electra-small-discriminator>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
......@@ -45,59 +45,61 @@ class ElectraConfig(PretrainedConfig):
Args:
vocab_size (:obj:`int`, optional, defaults to 30522):
Vocabulary size of the ELECTRA model. Defines the different tokens that
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.ElectraModel`.
embedding_size (:obj:`int`, optional, defaults to 128):
vocab_size (:obj:`int`, `optional`, defaults to 30522):
Vocabulary size of the ELECTRA model. Defines the number of different tokens that can be represented by the
:obj:`inputs_ids` passed when calling :class:`~transformers.ElectraModel` or
:class:`~transformers.TFElectraModel`.
embedding_size (:obj:`int`, `optional`, defaults to 128):
Dimensionality of the encoder layers and the pooler layer.
hidden_size (:obj:`int`, optional, defaults to 256):
hidden_size (:obj:`int`, `optional`, defaults to 256):
Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (:obj:`int`, optional, defaults to 12):
num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
Number of hidden layers in the Transformer encoder.
num_attention_heads (:obj:`int`, optional, defaults to 4):
num_attention_heads (:obj:`int`, `optional`, defaults to 4):
Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (:obj:`int`, optional, defaults to 1024):
intermediate_size (:obj:`int`, `optional`, defaults to 1024):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "gelu"):
hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler.
If string, "gelu", "relu", "swish" and "gelu_new" are supported.
hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1):
If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout ratio for the attention probabilities.
max_position_embeddings (:obj:`int`, optional, defaults to 512):
max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
The maximum sequence length that this model might ever be used with.
Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (:obj:`int`, optional, defaults to 2):
The vocabulary size of the `token_type_ids` passed into :class:`~transformers.ElectraModel`.
initializer_range (:obj:`float`, optional, defaults to 0.02):
type_vocab_size (:obj:`int`, `optional`, defaults to 2):
The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.ElectraModel` or
:class:`~transformers.TFElectraModel`.
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
The epsilon used by the layer normalization layers.
summary_type (:obj:`string`, optional, defaults to "first"):
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.ElectraForMultipleChoice`.
Is one of the following options:
- 'last' => take the last token hidden state (like XLNet)
- 'first' => take the first token hidden state (like Bert)
- 'mean' => take the mean of all tokens hidden states
- 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)
- 'attn' => Not implemented now, use multi-head attention
summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.ElectraForMultipleChoice`.
Add a projection after the vector extraction
summary_activation (:obj:`string` or :obj:`None`, optional):
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.ElectraForMultipleChoice`.
'gelu' => add a gelu activation to the output, Other => no activation.
summary_last_dropout (:obj:`float`, optional, defaults to 0.0):
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.ElectraForMultipleChoice`.
Add a dropout after the projection and activation
Example::
summary_type (:obj:`str`, `optional`, defaults to :obj:`"first"`):
Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
Has to be one of the following options:
- :obj:`"last"`: Take the last token hidden state (like XLNet).
- :obj:`"first"`: Take the first token hidden state (like BERT).
- :obj:`"mean"`: Take the mean of all tokens hidden states.
- :obj:`"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
- :obj:`"attn"`: Not implemented now, use multi-head attention.
summary_use_proj (:obj:`bool`, `optional`, defaults to :obj:`True`):
Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
Whether or not to add a projection after the vector extraction.
summary_activation (:obj:`str`, `optional`):
Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
Pass :obj:`"gelu"` for a gelu activation to the output, any other value will result in no activation.
summary_last_dropout (:obj:`float`, `optional`, defaults to 0.0):
Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
The dropout ratio to be used after the projection and activation.
Examples::
>>> from transformers import ElectraModel, ElectraConfig
......
......@@ -25,22 +25,24 @@ logger = logging.get_logger(__name__)
class EncoderDecoderConfig(PretrainedConfig):
r"""
:class:`~transformers.EncoderDecoderConfig` is the configuration class to store the configuration of a `EncoderDecoderModel`.
:class:`~transformers.EncoderDecoderConfig` is the configuration class to store the configuration of a
:class:`~transformers.EncoderDecoderModel`. It is used to instantiate an Encoder Decoder model according to the
specified arguments, defining the encoder and decoder configs.
It is used to instantiate an Encoder Decoder model according to the specified arguments, defining the encoder and decoder configs.
Configuration objects inherit from :class:`~transformers.PretrainedConfig`
and can be used to control the model outputs.
See the documentation for :class:`~transformers.PretrainedConfig` for more information.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
for more information.
Args:
kwargs (`optional`):
Remaining dictionary of keyword arguments. Notably:
encoder (:class:`PretrainedConfig`, optional, defaults to `None`):
An instance of a configuration object that defines the encoder config.
decoder (:class:`PretrainedConfig`, optional, defaults to `None`):
An instance of a configuration object that defines the decoder config.
Dictionary of keyword arguments. Notably:
Example::
- **encoder** (:class:`~transformers.PretrainedConfig`, `optional`) -- An instance of a configuration
object that defines the encoder config.
- **decoder** (:class:`~transformers.PretrainedConfig`, `optional`) -- An instance of a configuration
object that defines the decoder config.
Examples::
>>> from transformers import BertConfig, EncoderDecoderConfig, EncoderDecoderModel
......
......@@ -30,11 +30,9 @@ FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class FlaubertConfig(XLMConfig):
"""
Configuration class to store the configuration of a `FlaubertModel`.
This is the configuration class to store the configuration of a :class:`~transformers.XLMModel`.
It is used to instantiate an XLM model according to the specified arguments, defining the model
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
the `xlm-mlm-en-2048 <https://huggingface.co/xlm-mlm-en-2048>`__ architecture.
This is the configuration class to store the configuration of a :class:`~transformers.FlaubertModel` or a
:class:`~transformers.TFFlaubertModel`. It is used to instantiate a FlauBERT model according to the specified
arguments, defining the model architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
......@@ -47,95 +45,95 @@ class FlaubertConfig(XLMConfig):
layerdrop (:obj:`float`, `optional`, defaults to 0.0):
Probability to drop layers during training (Fan et al., Reducing Transformer Depth on Demand
with Structured Dropout. ICLR 2020)
vocab_size (:obj:`int`, optional, defaults to 30145):
Vocabulary size of the Flaubert model. Defines the different tokens that
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.FlaubertModel`.
emb_dim (:obj:`int`, optional, defaults to 2048):
vocab_size (:obj:`int`, `optional`, defaults to 30145):
Vocabulary size of the FlauBERT model. Defines the number of different tokens that can be represented by the
:obj:`inputs_ids` passed when calling :class:`~transformers.FlaubertModel` or
:class:`~transformers.TFFlaubertModel`.
emb_dim (:obj:`int`, `optional`, defaults to 2048):
Dimensionality of the encoder layers and the pooler layer.
n_layer (:obj:`int`, optional, defaults to 12):
n_layer (:obj:`int`, `optional`, defaults to 12):
Number of hidden layers in the Transformer encoder.
n_head (:obj:`int`, optional, defaults to 16):
n_head (:obj:`int`, `optional`, defaults to 16):
Number of attention heads for each attention layer in the Transformer encoder.
dropout (:obj:`float`, optional, defaults to 0.1):
dropout (:obj:`float`, `optional`, defaults to 0.1):
The dropout probability for all fully connected
layers in the embeddings, encoder, and pooler.
attention_dropout (:obj:`float`, optional, defaults to 0.1):
attention_dropout (:obj:`float`, `optional`, defaults to 0.1):
The dropout probability for the attention mechanism
gelu_activation (:obj:`boolean`, optional, defaults to :obj:`True`):
The non-linear activation function (function or string) in the
encoder and pooler. If set to `True`, "gelu" will be used instead of "relu".
sinusoidal_embeddings (:obj:`boolean`, optional, defaults to :obj:`False`):
Whether to use sinusoidal positional embeddings instead of absolute positional embeddings.
causal (:obj:`boolean`, optional, defaults to :obj:`False`):
Set this to `True` for the model to behave in a causal manner.
gelu_activation (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to use a `gelu` actibation instead of `relu`.
sinusoidal_embeddings (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not to use sinusoidal positional embeddings instead of absolute positional embeddings.
causal (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not the model shoul behave in a causal manner.
Causal models use a triangular attention mask in order to only attend to the left-side context instead
if a bidirectional context.
asm (:obj:`boolean`, optional, defaults to :obj:`False`):
Whether to use an adaptive log softmax projection layer instead of a linear layer for the prediction
asm (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not to use an adaptive log softmax projection layer instead of a linear layer for the prediction
layer.
n_langs (:obj:`int`, optional, defaults to 1):
n_langs (:obj:`int`, `optional`, defaults to 1):
The number of languages the model handles. Set to 1 for monolingual models.
use_lang_emb (:obj:`boolean`, optional, defaults to :obj:`True`)
use_lang_emb (:obj:`bool`, `optional`, defaults to :obj:`True`)
Whether to use language embeddings. Some models use additional language embeddings, see
`the multilingual models page <http://huggingface.co/transformers/multilingual.html#xlm-language-embeddings>`__
for information on how to use them.
max_position_embeddings (:obj:`int`, optional, defaults to 512):
max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
The maximum sequence length that this model might
ever be used with. Typically set this to something large just in case
(e.g., 512 or 1024 or 2048).
embed_init_std (:obj:`float`, optional, defaults to 2048^-0.5):
embed_init_std (:obj:`float`, `optional`, defaults to 2048^-0.5):
The standard deviation of the truncated_normal_initializer for
initializing the embedding matrices.
init_std (:obj:`int`, optional, defaults to 50257):
init_std (:obj:`int`, `optional`, defaults to 50257):
The standard deviation of the truncated_normal_initializer for
initializing all weight matrices except the embedding matrices.
layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
The epsilon used by the layer normalization layers.
bos_index (:obj:`int`, optional, defaults to 0):
bos_index (:obj:`int`, `optional`, defaults to 0):
The index of the beginning of sentence token in the vocabulary.
eos_index (:obj:`int`, optional, defaults to 1):
eos_index (:obj:`int`, `optional`, defaults to 1):
The index of the end of sentence token in the vocabulary.
pad_index (:obj:`int`, optional, defaults to 2):
pad_index (:obj:`int`, `optional`, defaults to 2):
The index of the padding token in the vocabulary.
unk_index (:obj:`int`, optional, defaults to 3):
unk_index (:obj:`int`, `optional`, defaults to 3):
The index of the unknown token in the vocabulary.
mask_index (:obj:`int`, optional, defaults to 5):
mask_index (:obj:`int`, `optional`, defaults to 5):
The index of the masking token in the vocabulary.
is_encoder(:obj:`boolean`, optional, defaults to :obj:`True`):
Whether the initialized model should be a transformer encoder or decoder as seen in Vaswani et al.
summary_type (:obj:`string`, optional, defaults to "first"):
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.XLMForSequenceClassification`.
Is one of the following options:
- 'last' => take the last token hidden state (like XLNet)
- 'first' => take the first token hidden state (like Bert)
- 'mean' => take the mean of all tokens hidden states
- 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)
- 'attn' => Not implemented now, use multi-head attention
summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.XLMForSequenceClassification`.
Add a projection after the vector extraction
summary_activation (:obj:`string` or :obj:`None`, optional):
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.XLMForSequenceClassification`.
'tanh' => add a tanh activation to the output, Other => no activation.
summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.XLMForSequenceClassification`.
If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.
summary_first_dropout (:obj:`float`, optional, defaults to 0.1):
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.XLMForSequenceClassification`.
Add a dropout before the projection and activation
start_n_top (:obj:`int`, optional, defaults to 5):
Used in the SQuAD evaluation script for XLM and XLNet.
end_n_top (:obj:`int`, optional, defaults to 5):
Used in the SQuAD evaluation script for XLM and XLNet.
mask_token_id (:obj:`int`, optional, defaults to 0):
is_encoder(:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not the initialized model should be a transformer encoder or decoder as seen in Vaswani et al.
summary_type (:obj:`string`, `optional`, defaults to "first"):
Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
Has to be one of the following options:
- :obj:`"last"`: Take the last token hidden state (like XLNet).
- :obj:`"first"`: Take the first token hidden state (like BERT).
- :obj:`"mean"`: Take the mean of all tokens hidden states.
- :obj:`"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
- :obj:`"attn"`: Not implemented now, use multi-head attention.
summary_use_proj (:obj:`bool`, `optional`, defaults to :obj:`True`):
Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
Whether or not to add a projection after the vector extraction.
summary_activation (:obj:`str`, `optional`):
Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
Pass :obj:`"tanh"` for a tanh activation to the output, any other value will result in no activation.
summary_proj_to_labels (:obj:`bool`, `optional`, defaults to :obj:`True`):
Used in the sequence classification and multiple choice models.
Whether the projection outputs should have :obj:`config.num_labels` or :obj:`config.hidden_size` classes.
summary_first_dropout (:obj:`float`, `optional`, defaults to 0.1):
Used in the sequence classification and multiple choice models.
The dropout ratio to be used after the projection and activation.
start_n_top (:obj:`int`, `optional`, defaults to 5):
Used in the SQuAD evaluation script.
end_n_top (:obj:`int`, `optional`, defaults to 5):
Used in the SQuAD evaluation script.
mask_token_id (:obj:`int`, `optional`, defaults to 0):
Model agnostic parameter to identify masked tokens when generating text in an MLM context.
lang_id (:obj:`int`, optional, defaults to 1):
lang_id (:obj:`int`, `optional`, defaults to 1):
The ID of the language used by the model. This parameter is used when generating
text in a given language.
"""
......
......@@ -18,7 +18,6 @@
import copy
from .configuration_utils import PretrainedConfig
from .file_utils import add_start_docstrings_to_callable
from .utils import logging
......@@ -27,33 +26,54 @@ logger = logging.get_logger(__name__)
FSMT_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
FSMT_CONFIG_ARGS_DOC = r"""
class DecoderConfig(PretrainedConfig):
r"""
Configuration class for FSMT's decoder specific things.
note: this is a private helper class
"""
model_type = "fsmt_decoder"
def __init__(self, vocab_size=0, bos_token_id=0):
super().__init__()
self.vocab_size = vocab_size
self.bos_token_id = bos_token_id
class FSMTConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a :class:`~transformers.FSMTModel`. It is used to
instantiate a FSMT model according to the specified arguments, defining the model architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
for more information.
Args:
langs (:obj:`List[str]`):
source language, target_language (e.g. ['en', 'ru'])
A list with source language and target_language (e.g., ['en', 'ru']).
src_vocab_size (:obj:`int`):
defines the different tokens that can be represented by `inputs_ids` passed to the forward
method in the encoder.
Vocabulary size of the encoder. Defines the number of different tokens that can be represented by the
:obj:`inputs_ids` passed to the forward method in the encoder.
tgt_vocab_size (:obj:`int`):
defines the different tokens that can be represented by `inputs_ids` passed to the forward
method in the decoder.
Vocabulary size of the decoder. Defines the number of different tokens that can be represented by the
:obj:`inputs_ids` passed to the forward method in the decoder.
d_model (:obj:`int`, `optional`, defaults to 1024):
Dimensionality of the layers and the pooler layer.
encoder_layers (:obj:`int`, `optional`, defaults to 12):
Number of encoder layers, 16 for pegasus, 6 for bart-base and marian
Number of encoder layers.
decoder_layers (:obj:`int`, `optional`, defaults to 12):
Number of decoder layers, 16 for pegasus, 6 for bart-base and marian
Number of decoder layers.
encoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
Number of attention heads for each attention layer in the Transformer encoder.
decoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
Number of attention heads for each attention layer in the Transformer decoder.
decoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in decoder.
Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
encoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in decoder.
activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to "relu"):
Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
activation_function (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"relu"`):
The non-linear activation function (function or string) in the encoder and pooler.
If string, "gelu", "relu", "swish" and "gelu_new" are supported.
If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
dropout (:obj:`float`, `optional`, defaults to 0.1):
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (:obj:`float`, `optional`, defaults to 0.0):
......@@ -74,7 +94,7 @@ FSMT_CONFIG_ARGS_DOC = r"""
eos_token_id (:obj:`int`, `optional`, defaults to 2)
End of stream token id.
decoder_start_token_id (:obj:`int`, `optional`):
This model starts decoding with `eos_token_id`
This model starts decoding with :obj:`eos_token_id`
encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
Google "layerdrop arxiv", as its not explainable in one line.
decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
......@@ -92,26 +112,14 @@ FSMT_CONFIG_ARGS_DOC = r"""
early_stopping (:obj:`bool`, `optional`, defaults to :obj:`False`)
Flag that will be used by default in the :obj:`generate` method of the model. Whether to stop
the beam search when at least ``num_beams`` sentences are finished per batch or not.
"""
Examples::
class DecoderConfig(PretrainedConfig):
r"""
Configuration class for FSMT's decoder specific things.
note: this is a private helper class
"""
model_type = "fsmt_decoder"
def __init__(self, vocab_size=0, bos_token_id=0):
super().__init__()
self.vocab_size = vocab_size
self.bos_token_id = bos_token_id
>>> from transformers import FSMTConfig, FSMTModel
>>> config = FSMTConfig.from_pretrained('facebook/wmt19-en-ru')
>>> model = FSMTModel(config)
@add_start_docstrings_to_callable(FSMT_CONFIG_ARGS_DOC)
class FSMTConfig(PretrainedConfig):
r"""
Configuration class for FSMT.
"""
model_type = "fsmt"
......@@ -149,17 +157,6 @@ class FSMTConfig(PretrainedConfig):
early_stopping=False,
**common_kwargs
):
r"""
:class:`~transformers.FSMTConfig` is the configuration class for `FSMTModel`.
Examples::
>>> from transformers import FSMTConfig, FSMTModel
>>> config = FSMTConfig.from_pretrained('facebook/wmt19-en-ru')
>>> model = FSMTModel(config)
"""
if "hidden_size" in common_kwargs:
raise ValueError("hidden size is called d_model")
super().__init__(
......
......@@ -36,20 +36,21 @@ FUNNEL_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class FunnelConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a :class:`~transformers.FunnelModel`.
It is used to instantiate an Funnel Transformer model according to the specified arguments, defining the model
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
the Funnel Transformer `funnel-transformer/small <https://huggingface.co/funnel-transformer/small>`__ architecture.
This is the configuration class to store the configuration of a :class:`~transformers.FunnelModel` or a
:class:`~transformers.TFBertModel`. It is used to instantiate a Funnel Transformer model according to the specified
arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
configuration to that of the Funnel Transformer `funnel-transformer/small
<https://huggingface.co/funnel-transformer/small>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
for more information.
Args:
vocab_size (:obj:`int`, `optional`, defaults to 30522):
Vocabulary size of the Funnel transformer. Defines the different tokens that
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.FunnelModel`.
Vocabulary size of the Funnel transformer. Defines the number of different tokens that can be represented
by the :obj:`inputs_ids` passed when calling :class:`~transformers.FunnelModel` or
:class:`~transformers.TFFunnelModel`.
block_sizes (:obj:`List[int]`, `optional`, defaults to :obj:`[4, 4, 4]`):
The sizes of the blocks used in the model.
block_repeats (:obj:`List[int]`, `optional`):
......@@ -77,7 +78,8 @@ class FunnelConfig(PretrainedConfig):
The maximum sequence length that this model might ever be used with.
Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (:obj:`int`, `optional`, defaults to 3):
The vocabulary size of the `token_type_ids` passed into :class:`~transformers.FunnelModel`.
The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.FunnelModel` or
:class:`~transformers.TFFunnelModel`.
initializer_range (:obj:`float`, `optional`, defaults to 0.1):
The standard deviation of the `uniform initializer` for initializing all weight matrices in attention
layers.
......
......@@ -32,10 +32,10 @@ GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class GPT2Config(PretrainedConfig):
"""
This is the configuration class to store the configuration of a :class:`~transformers.GPT2Model`.
It is used to instantiate an GPT-2 model according to the specified arguments, defining the model
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
the GPT-2 `small <https://huggingface.co/gpt2>`__ architecture.
This is the configuration class to store the configuration of a :class:`~transformers.GPT2Model` or a
:class:`~transformers.TFGPT2Model`. It is used to instantiate a GPT-2 model according to the specified
arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
configuration to that of the GPT-2 `small <https://huggingface.co/gpt2>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
......@@ -43,60 +43,66 @@ class GPT2Config(PretrainedConfig):
Args:
vocab_size (:obj:`int`, optional, defaults to 50257):
Vocabulary size of the GPT-2 model. Defines the different tokens that
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.GPT2Model`.
n_positions (:obj:`int`, optional, defaults to 1024):
vocab_size (:obj:`int`, `optional`, defaults to 50257):
Vocabulary size of the GPT-2 model. Defines the number of different tokens that can be represented by the
:obj:`inputs_ids` passed when calling :class:`~transformers.GPT2Model` or
:class:`~transformers.TFGPT2Model`.
n_positions (:obj:`int`, `optional`, defaults to 1024):
The maximum sequence length that this model might ever be used with.
Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
n_ctx (:obj:`int`, optional, defaults to 1024):
n_ctx (:obj:`int`, `optional`, defaults to 1024):
Dimensionality of the causal mask (usually same as n_positions).
n_embd (:obj:`int`, optional, defaults to 768):
n_embd (:obj:`int`, `optional`, defaults to 768):
Dimensionality of the embeddings and hidden states.
n_layer (:obj:`int`, optional, defaults to 12):
n_layer (:obj:`int`, `optional`, defaults to 12):
Number of hidden layers in the Transformer encoder.
n_head (:obj:`int`, optional, defaults to 12):
n_head (:obj:`int`, `optional`, defaults to 12):
Number of attention heads for each attention layer in the Transformer encoder.
n_inner (:obj:`int`, optional, defaults to None):
n_inner (:obj:`int`, `optional`, defaults to None):
Dimensionality of the inner feed-forward layers. :obj:`None` will set it to 4 times n_embd
activation_function (:obj:`str`, optional, defaults to 'gelu'):
Activation function selected in the list ["relu", "swish", "gelu", "tanh", "gelu_new"].
resid_pdrop (:obj:`float`, optional, defaults to 0.1):
activation_function (:obj:`str`, `optional`, defaults to :obj:`"gelu"`):
Activation function, to be selected in the list :obj:`["relu", "swish", "gelu", "tanh", "gelu_new"]`.
resid_pdrop (:obj:`float`, `optional`, defaults to 0.1):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
embd_pdrop (:obj:`int`, optional, defaults to 0.1):
embd_pdrop (:obj:`int`, `optional`, defaults to 0.1):
The dropout ratio for the embeddings.
attn_pdrop (:obj:`float`, optional, defaults to 0.1):
attn_pdrop (:obj:`float`, `optional`, defaults to 0.1):
The dropout ratio for the attention.
layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-5):
layer_norm_epsilon (:obj:`float`, `optional`, defaults to 1e-5):
The epsilon to use in the layer normalization layers
initializer_range (:obj:`float`, optional, defaults to 0.02):
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
summary_type (:obj:`string`, optional, defaults to "cls_index"):
summary_type (:obj:`string`, `optional`, defaults to :obj:`"cls_index"`):
Argument used when doing sequence summary, used in the models
:class:`~transformers.GPT2DoubleHeadsModel` and :class:`~transformers.TFGPT2DoubleHeadsModel`.
Has to be one of the following options:
- :obj:`"last"`: Take the last token hidden state (like XLNet).
- :obj:`"first"`: Take the first token hidden state (like BERT).
- :obj:`"mean"`: Take the mean of all tokens hidden states.
- :obj:`"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
- :obj:`"attn"`: Not implemented now, use multi-head attention.
summary_use_proj (:obj:`bool`, `optional`, defaults to :obj:`True`):
Argument used when doing sequence summary, used in the models
:class:`~transformers.GPT2DoubleHeadsModel` and :class:`~transformers.TFGPT2DoubleHeadsModel`.
Whether or not to add a projection after the vector extraction.
summary_activation (:obj:`str`, `optional`):
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.GPT2DoubleHeadsModel`.
Is one of the following options:
- 'last' => take the last token hidden state (like XLNet)
- 'first' => take the first token hidden state (like Bert)
- 'mean' => take the mean of all tokens hidden states
- 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)
- 'attn' => Not implemented now, use multi-head attention
summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.GPT2DoubleHeadsModel`.
Add a projection after the vector extraction
summary_activation (:obj:`string` or :obj:`None`, optional):
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.GPT2DoubleHeadsModel`.
'tanh' => add a tanh activation to the output, Other => no activation.
summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.GPT2DoubleHeadsModel`.
If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.
summary_first_dropout (:obj:`float`, optional, defaults to 0.1):
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.GPT2DoubleHeadsModel`.
Add a dropout before the projection and activation
Pass :obj:`"tanh"` for a tanh activation to the output, any other value will result in no activation.
summary_proj_to_labels (:obj:`bool`, `optional`, defaults to :obj:`True`):
Argument used when doing sequence summary, used in the models
:class:`~transformers.GPT2DoubleHeadsModel` and :class:`~transformers.TFGPT2DoubleHeadsModel`.
Whether the projection outputs should have :obj:`config.num_labels` or :obj:`config.hidden_size` classes.
summary_first_dropout (:obj:`float`, `optional`, defaults to 0.1):
Argument used when doing sequence summary, used in the models
:class:`~transformers.GPT2DoubleHeadsModel` and :class:`~transformers.TFGPT2DoubleHeadsModel`.
The dropout ratio to be used after the projection and activation.
Example::
......
......@@ -33,6 +33,10 @@ LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class LongformerConfig(RobertaConfig):
r"""
This is the configuration class to store the configuration of a :class:`~transformers.LongformerModel` or a
:class:`~transformers.TFLongformerModel`. It is used to instantiate a Longformer model according to the specified
arguments, defining the model architecture.
This is the configuration class to store the configuration of a :class:`~transformers.LongformerModel`.
It is used to instantiate an Longformer model according to the specified arguments, defining the model
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
......@@ -42,8 +46,8 @@ class LongformerConfig(RobertaConfig):
It reuses the same defaults. Please check the parent class for more information.
Args:
attention_window (:obj:`int` or :obj:`List[int]`, optional, defaults to 512):
Size of an attention window around each token. If :obj:`int`, use the same size for all layers.
attention_window (:obj:`int` or :obj:`List[int]`, `optional`, defaults to 512):
Size of an attention window around each token. If an :obj:`int`, use the same size for all layers.
To specify a different window size for each layer, use a :obj:`List[int]` where
``len(attention_window) == num_hidden_layers``.
......
......@@ -29,83 +29,91 @@ LXMERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class LxmertConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a :class:`~transformers.BertModel`.
It is used to instantiate an Lxmert model according to the specified arguments, defining the model
architecture.
This is the configuration class to store the configuration of a :class:`~transformers.LxmertModel` or a
:class:`~transformers.TFLxmertModel`. It is used to instantiate a LXMERT model according to the specified
arguments, defining the model architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
for more information.
Args:
vocab_size (:obj:`int`, optional, defaults to 30522):
Vocabulary size of the BERT model. Defines the different tokens that
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.BertModel`.
hidden_size (:obj:`int`, optional, defaults to 768):
vocab_size (:obj:`int`, `optional`, defaults to 30522):
Vocabulary size of the LXMERT model. Defines the number of different tokens that can be represented by the
:obj:`inputs_ids` passed when calling :class:`~transformers.LxmertModel` or
:class:`~transformers.TFLxmertModel`.
hidden_size (:obj:`int`, `optional`, defaults to 768):
Dimensionality of the encoder layers and the pooler layer.
r_layers (:obj:`int`, optional, defaults to 5):
r_layers (:obj:`int`, `optional`, defaults to 5):
Number of hidden layers in the Transformer visual encoder.
l_layers (:obj:`int`, optional, defaults to 9):
l_layers (:obj:`int`, `optional`, defaults to 9):
Number of hidden layers in the Transformer language encoder.
x_layers (:obj:`int`, optional, defaults to 5):
x_layers (:obj:`int`, `optional`, defaults to 5):
Number of hidden layers in the Transformer cross modality encoder.
num_attention_heads (:obj:`int`, optional, defaults to 5):
num_attention_heads (:obj:`int`, `optional`, defaults to 5):
Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (:obj:`int`, optional, defaults to 3072):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "gelu"):
intermediate_size (:obj:`int`, `optional`, defaults to 3072):
Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler.
If string, "gelu", "relu", "swish" and "gelu_new" are supported.
hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1):
If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout ratio for the attention probabilities.
max_position_embeddings (:obj:`int`, optional, defaults to 512):
max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
The maximum sequence length that this model might ever be used with.
Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (:obj:`int`, optional, defaults to 2):
type_vocab_size (:obj:`int`, `optional`, defaults to 2):
The vocabulary size of the `token_type_ids` passed into :class:`~transformers.BertModel`.
initializer_range (:obj:`float`, optional, defaults to 0.02):
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
The epsilon used by the layer normalization layers.
visual_feat_dim (:obj:`int`, optional, defaults to 2048):
visual_feat_dim (:obj:`int`, `optional`, defaults to 2048):
This represents the last dimension of the pooled-object features used as input for the model,
representing the size of each object feature itself.
visual_pos_dim (:obj:`int`, optional, defaults to 4):
visual_pos_dim (:obj:`int`, `optional`, defaults to 4):
This represents the number of spacial features that are mixed into the visual features.
The default is set to 4 because most commonly this will represent the location of a bounding box.
i.e. (x, y, width, height)
visual_loss_normalizer (:obj:`float`, optional, defaults to 1/15):
i.e., (x, y, width, height)
visual_loss_normalizer (:obj:`float`, `optional`, defaults to 1/15):
This represents the scaling factor in which each visual loss is multiplied by if during pretraining,
one decided to train with multiple vision-based loss objectives.
num_qa_labels (:obj:`int`, optional, defaults to 9500):
This represents the total number of different question answering (QA) labels there are. If using more than one dataset with QA,
the user will need to account for the total number of labels that all of the datasets have in total.
num_object_labels (:obj:`int`, optional, defaults to 1600):
This represents the total number of semantically unique objects that lxmert will be able to classify a pooled-object feature
as belonging too.
num_attr_labels (:obj:`int`, optional, defaults to 400):
This represents the total number of semantically unique attributes that lxmert will be able to classify a pooled-object feature
as possessing.
task_matched (:obj:`bool`, optional, defaults to :obj:`True`):
This task is used for sentence-image matching. If the sentence correctly describes the image the label will be 1.
If the sentence does not correctly describe the image, the label will be 0.
task_mask_lm (:obj:`bool`, optional, defaults to :obj:`True`):
This task is the defacto masked langauge modeling used in pretraining models such as BERT.
task_obj_predict (:obj:`bool`, optional, defaults to :obj:`True`):
This task is set to true if the user would like to perform one of the following loss objectives:
object predicition, atrribute predicition, feature regression
task_qa (:obj:`bool`, optional, defaults to :obj:`True`):
This task specifies whether or not Lxmert will calculate the question-asnwering loss objective
visual_obj_loss (:obj:`bool`, optional, defaults to :obj:`True`):
This task specifies whether or not Lxmert will calculate the object-prediction loss objective
visual_attr_loss (:obj:`bool`, optional, defaults to :obj:`True`):
This task specifies whether or not Lxmert will calculate the attribute-prediction loss objective
visual_feat_loss (:obj:`bool`, optional, defaults to :obj:`True`):
This task specifies whether or not Lxmert will calculate the feature-regression loss objective
output_attentions (:obj:`bool`, optional, defaults to :obj:`False`):
if True, the vision, langauge, and cross-modality layers will be returned
output_hidden_states (:obj:`bool`, optional, defaults to :obj:`False`):
if True, final cross-modality hidden states for language and vision features will be returned
num_qa_labels (:obj:`int`, `optional`, defaults to 9500):
This represents the total number of different question answering (QA) labels there are. If using more than
one dataset with QA, the user will need to account for the total number of labels that all of the datasets
have in total.
num_object_labels (:obj:`int`, `optional`, defaults to 1600):
This represents the total number of semantically unique objects that lxmert will be able to classify a
pooled-object feature as belonging too.
num_attr_labels (:obj:`int`, `optional`, defaults to 400):
This represents the total number of semantically unique attributes that lxmert will be able to classify a
pooled-object feature as possessing.
task_matched (:obj:`bool`, `optional`, defaults to :obj:`True`):
This task is used for sentence-image matching. If the sentence correctly describes the image the label
will be 1. If the sentence does not correctly describe the image, the label will be 0.
task_mask_lm (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to add masked language modeling (as used in pretraining models such as BERT) to the loss
objective.
task_obj_predict (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to add object predicition, attribute predicition and feature regression to the loss
objective.
task_qa (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to add the question-asnwering loss to the objective
visual_obj_loss (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to calculate the object-prediction loss objective
visual_attr_loss (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to calculate the attribute-prediction loss objective
visual_feat_loss (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to calculate the feature-regression loss objective
output_attentions (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not the model should return the attentions from the vision, langauge, and cross-modality
layers should be returned.
output_hidden_states (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not the model should return the hidden states from the vision, langauge, and cross-modality
layers should be returned.
"""
model_type = "lxmert"
......
......@@ -22,15 +22,16 @@ logger = logging.get_logger(__name__)
class MMBTConfig(object):
"""Configuration class to store the configuration of a `MMBT Model`.
"""
This is the configuration class to store the configuration of a :class:`~transformers.MMBTModel`. It is used to
instantiate a MMBT model according to the specified arguments, defining the model architecture.
Args:
config (:obj:`~transformers.PreTrainedConfig`):
Config of the underlying Transformer models. Its values are
copied over to use a single config.
num_labels (:obj:`int` or :obj:`None`, optional, defaults to `None`):
config (:class:`~transformers.PreTrainedConfig`):
Config of the underlying Transformer models. Its values are copied over to use a single config.
num_labels (:obj:`int`, `optional`):
Size of final Linear layer for classification.
modal_hidden_size (:obj:`int`, optional, defautls to 2048):
modal_hidden_size (:obj:`int`, `optional`, defautls to 2048):
Embedding dimension of the non-text modality encoder.
"""
......
......@@ -25,9 +25,9 @@ MOBILEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class MobileBertConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a :class:`~transformers.MobileBertModel`.
It is used to instantiate a MobileBERT model according to the specified arguments, defining the model
architecture.
This is the configuration class to store the configuration of a :class:`~transformers.MobileBertModel` or a
:class:`~transformers.TFMobileBertModel`. It is used to instantiate a MobileBERT model according to the specified
arguments, defining the model architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
......@@ -35,54 +35,56 @@ class MobileBertConfig(PretrainedConfig):
Args:
vocab_size (:obj:`int`, optional, defaults to 30522):
Vocabulary size of the MobileBERT model. Defines the different tokens that
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.MobileBertModel`.
hidden_size (:obj:`int`, optional, defaults to 512):
vocab_size (:obj:`int`, `optional`, defaults to 30522):
Vocabulary size of the MobileBERT model. Defines the number of different tokens that can be represented by
the :obj:`inputs_ids` passed when calling :class:`~transformers.MobileBertModel` or
:class:`~transformers.TFMobileBertModel`.
hidden_size (:obj:`int`, `optional`, defaults to 512):
Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (:obj:`int`, optional, defaults to 24):
num_hidden_layers (:obj:`int`, `optional`, defaults to 24):
Number of hidden layers in the Transformer encoder.
num_attention_heads (:obj:`int`, optional, defaults to 4):
num_attention_heads (:obj:`int`, `optional`, defaults to 4):
Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (:obj:`int`, optional, defaults to 512):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "relu"):
intermediate_size (:obj:`int`, `optional`, defaults to 512):
Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"relu"`):
The non-linear activation function (function or string) in the encoder and pooler.
If string, "gelu", "relu", "swish" and "gelu_new" are supported.
hidden_dropout_prob (:obj:`float`, optional, defaults to 0.0):
If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.0):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout ratio for the attention probabilities.
max_position_embeddings (:obj:`int`, optional, defaults to 512):
max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
The maximum sequence length that this model might ever be used with.
Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (:obj:`int`, optional, defaults to 2):
The vocabulary size of the `token_type_ids` passed into :class:`~transformers.MobileBertModel`.
initializer_range (:obj:`float`, optional, defaults to 0.02):
type_vocab_size (:obj:`int`, `optional`, defaults to 2):
The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.MobileBertModel`
or :class:`~transformers.TFMobileBertModel`.
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
The epsilon used by the layer normalization layers.
pad_token_id (:obj:`int`, optional, defaults to 0):
pad_token_id (:obj:`int`, `optional`, defaults to 0):
The ID of the token in the word embedding to use as padding.
embedding_size (:obj:`int`, optional, defaults to 128):
embedding_size (:obj:`int`, `optional`, defaults to 128):
The dimension of the word embedding vectors.
trigram_input (:obj:`bool`, optional, defaults to :obj:`True`):
trigram_input (:obj:`bool`, `optional`, defaults to :obj:`True`):
Use a convolution of trigram as input.
use_bottleneck (:obj:`bool`, optional, defaults to :obj:`True`):
use_bottleneck (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether to use bottleneck in BERT.
intra_bottleneck_size (:obj:`int`, optional, defaults to 128):
intra_bottleneck_size (:obj:`int`, `optional`, defaults to 128):
Size of bottleneck layer output.
use_bottleneck_attention (:obj:`bool`, optional, defaults to :obj:`False`):
use_bottleneck_attention (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to use attention inputs from the bottleneck transformation.
key_query_shared_bottleneck (:obj:`bool`, optional, defaults to :obj:`True`):
key_query_shared_bottleneck (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether to use the same linear transformation for query&key in the bottleneck.
num_feedforward_networks (:obj:`int`, optional, defaults to 4):
num_feedforward_networks (:obj:`int`, `optional`, defaults to 4):
Number of FFNs in a block.
normalization_type (:obj:`str`, optional, defaults to "no_norm"):
The normalization type in BERT.
normalization_type (:obj:`str`, `optional`, defaults to :obj:`"no_norm"`):
The normalization type in MobileBERT.
Example:
Examples:
>>> from transformers import MobileBertModel, MobileBertConfig
......
......@@ -28,73 +28,79 @@ OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class OpenAIGPTConfig(PretrainedConfig):
"""
This is the configuration class to store the configuration of a :class:`~transformers.OpenAIGPTModel`.
It is used to instantiate an GPT model according to the specified arguments, defining the model
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
the `GPT <https://huggingface.co/openai-gpt>`__ architecture from OpenAI.
This is the configuration class to store the configuration of a :class:`~transformers.OpenAIGPTModel` or a
:class:`~transformers.TFOpenAIGPTModel`. It is used to instantiate a GPT model according to the specified
arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
configuration to that of the `GPT <https://huggingface.co/openai-gpt>`__ architecture from OpenAI.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
for more information.
Args:
vocab_size (:obj:`int`, optional, defaults to 40478):
Vocabulary size of the GPT model. Defines the different tokens that
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.CTRLModel`.
n_positions (:obj:`int`, optional, defaults to 512):
vocab_size (:obj:`int`, `optional`, defaults to 40478):
Vocabulary size of the GPT-2 model. Defines the number of different tokens that can be represented by the
:obj:`inputs_ids` passed when calling :class:`~transformers.OpenAIGPTModel` or
:class:`~transformers.TFOpenAIGPTModel`.
n_positions (:obj:`int`, `optional`, defaults to 512):
The maximum sequence length that this model might ever be used with.
Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
n_ctx (:obj:`int`, optional, defaults to 512):
n_ctx (:obj:`int`, `optional`, defaults to 512):
Dimensionality of the causal mask (usually same as n_positions).
n_embd (:obj:`int`, optional, defaults to 768):
n_embd (:obj:`int`, `optional`, defaults to 768):
Dimensionality of the embeddings and hidden states.
n_layer (:obj:`int`, optional, defaults to 12):
n_layer (:obj:`int`, `optional`, defaults to 12):
Number of hidden layers in the Transformer encoder.
n_head (:obj:`int`, optional, defaults to 12):
n_head (:obj:`int`, `optional`, defaults to 12):
Number of attention heads for each attention layer in the Transformer encoder.
afn (:obj:`str` or :obj:`function`, optional, defaults to "gelu"):
afn (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler.
If string, "gelu", "relu", "swish" and "gelu_new" are supported.
resid_pdrop (:obj:`float`, optional, defaults to 0.1):
If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
resid_pdrop (:obj:`float`, `optional`, defaults to 0.1):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
embd_pdrop (:obj:`int`, optional, defaults to 0.1):
embd_pdrop (:obj:`int`, `optional`, defaults to 0.1):
The dropout ratio for the embeddings.
attn_pdrop (:obj:`float`, optional, defaults to 0.1):
attn_pdrop (:obj:`float`, `optional`, defaults to 0.1):
The dropout ratio for the attention.
layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-5):
layer_norm_epsilon (:obj:`float`, `optional`, defaults to 1e-5):
The epsilon to use in the layer normalization layers
initializer_range (:obj:`float`, optional, defaults to 0.02):
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
predict_special_tokens (:obj:`boolean`, optional, defaults to :obj:`True`):
Whether special tokens should be predicted when the model is has a language modeling head.
summary_type (:obj:`string`, optional, defaults to "cls_index"):
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.OpenAIGPTDoubleHeadsModel`.
Is one of the following options:
- 'last' => take the last token hidden state (like XLNet)
- 'first' => take the first token hidden state (like Bert)
- 'mean' => take the mean of all tokens hidden states
- 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)
- 'attn' => Not implemented now, use multi-head attention
summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.OpenAIGPTDoubleHeadsModel`.
Add a projection after the vector extraction
summary_activation (:obj:`string` or :obj:`None`, optional):
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.OpenAIGPTDoubleHeadsModel`.
'tanh' => add a tanh activation to the output, Other => no activation.
summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.OpenAIGPTDoubleHeadsModel`.
If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.
summary_first_dropout (:obj:`float`, optional, defaults to 0.1):
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.OpenAIGPTDoubleHeadsModel`.
Add a dropout before the projection and activation
Example::
predict_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not special tokens should be predicted when the model has a language modeling head.
summary_type (:obj:`str`, `optional`, defaults to :obj:`"cls_index"`):
Argument used when doing sequence summary, used in the models
:class:`~transformers.OpenAIGPTDoubleHeadsModel` and :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
Has to be one of the following options:
- :obj:`"last"`: Take the last token hidden state (like XLNet).
- :obj:`"first"`: Take the first token hidden state (like BERT).
- :obj:`"mean"`: Take the mean of all tokens hidden states.
- :obj:`"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
- :obj:`"attn"`: Not implemented now, use multi-head attention.
summary_use_proj (:obj:`bool`, `optional`, defaults to :obj:`True`):
Argument used when doing sequence summary, used in the models
:class:`~transformers.OpenAIGPTDoubleHeadsModel` and :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
Whether or not to add a projection after the vector extraction.
summary_activation (:obj:`str`, `optional`):
Argument used when doing sequence summary, used in the models
:class:`~transformers.OpenAIGPTDoubleHeadsModel` and :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
Pass :obj:`"tanh"` for a tanh activation to the output, any other value will result in no activation.
summary_proj_to_labels (:obj:`bool`, `optional`, defaults to :obj:`True`):
Argument used when doing sequence summary, used in the models
:class:`~transformers.OpenAIGPTDoubleHeadsModel` and :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
Whether the projection outputs should have :obj:`config.num_labels` or :obj:`config.hidden_size` classes.
summary_first_dropout (:obj:`float`, `optional`, defaults to 0.1):
Argument used when doing sequence summary, used in the models
:class:`~transformers.OpenAIGPTDoubleHeadsModel` and :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
The dropout ratio to be used after the projection and activation.
Examples::
>>> from transformers import OpenAIGPTConfig, OpenAIGPTModel
......
......@@ -29,96 +29,120 @@ REFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class ReformerConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a :class:`~transformers.ReformerModel`.
It is used to instantiate an Reformer model according to the specified arguments, defining the model
architecture.
This is the configuration class to store the configuration of a :class:`~transformers.ReformerModel`. It is used to
instantiate a Reformer model according to the specified arguments, defining the model architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
for more information.
Args:
attention_head_size (:obj:`int`, optional, defaults to 64):
attention_head_size (:obj:`int`, `optional`, defaults to 64):
Dimensionality of the projected key, query and value vectors
attn_layers (:obj:`list(str)`, optional, defaults to ["local", "lsh", "local", "lsh", "local", "lsh"]):
attn_layers (:obj:`List[str]`, `optional`, defaults to :obj:`["local", "lsh", "local", "lsh", "local", "lsh"]`):
List of attention layer types in ascending order. It can be chosen between a
LSHSelfAttention layer ("lsh") and a LocalSelfAttention layer ("local").
For more information on LSHSelfAttention layer, see `LSH Self Attention <reformer.html#lsh-self-attention>`__ .
For more information on LocalSelfAttention layer, see `Local Self Attention <reformer.html#local-sensitive-hashing-self-attention>`__ .
axial_pos_embds (:obj:`bool`, optional, defaults to :obj:`True`):
If `True` use axial position embeddings. For more information on how axial position embeddings work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__
axial_norm_std (:obj:`float`, optional, defaluts to 1.0):
The standard deviation of the normal_initializer for initializing the weight matrices of the axial positional encodings.
axial_pos_shape (:obj:`list(int)`, optional, defaults to `[64, 64]`):
The position dims of the axial position encodings.
During training the product of the position dims has to equal the sequence length.
For more information on how axial position embeddings work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__.
axial_pos_embds_dim (:obj:`list(int)`, optional, defaults to `[64, 192]`):
The embedding dims of the axial position encodings.
The sum of the embedding dims has to equal the hidden size.
For more information on how axial position embeddings work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__.
chunk_size_lm_head (:obj:`int`, optional, defaults to 0):
LSHSelfAttention layer (:obj:`"lsh"`) and a LocalSelfAttention layer (:obj:`"local"`).
For more information on LSHSelfAttention layer, see `LSH Self Attention
<reformer.html#lsh-self-attention>`__. For more information on LocalSelfAttention layer, see `Local Self
Attention <reformer.html#local-sensitive-hashing-self-attention>`__.
axial_pos_embds (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to use axial position embeddings. For more information on how axial position embeddings
work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__.
axial_norm_std (:obj:`float`, `optional`, defaults to 1.0):
The standard deviation of the normal_initializer for initializing the weight matrices of the axial
positional encodings.
axial_pos_shape (:obj:`List[int]`, `optional`, defaults to :obj:`[64, 64]`):
The position dims of the axial position encodings. During training the product of the position dims has to
be equal to the sequence length.
For more information on how axial position embeddings work, see `Axial Position Encodings
<reformer.html#axial-positional-encodings>`__.
axial_pos_embds_dim (:obj:`List[int]`, `optional`, defaults to :obj:`[64, 192]`):
The embedding dims of the axial position encodings. The sum of the embedding dims has to be equal to the
hidden size.
For more information on how axial position embeddings work, see `Axial Position Encodings
<reformer.html#axial-positional-encodings>`__.
chunk_size_lm_head (:obj:`int`, `optional`, defaults to 0):
The chunk size of the final language model feed forward head layer.
A chunk size of 0 means that the feed forward layer is not chunked.
A chunk size of n means that the feed forward layer processes n < sequence_length embeddings at a time.
For more information on feed forward chunking, see `How does Feed Forward Chunking work? <../glossary.html#feed-forward-chunking>`__ .
eos_token_id (:obj:`int`, optional, defaults to 2):
The token id for the <EOS> token.
feed_forward_size (:obj:`int`, optional, defaults to 512):
Dimensionality of the "feed_forward" (i.e., feed-forward) layer in the residual attention block.
hash_seed (:obj:`int`, optional, defaults to `None`):
Seed that can be used to make local sensitive hashing in LSHSelfAttention deterministic. This should only be set for testing purposed. For evaluation and training purposes `hash_seed` should be set to `None` to ensure fully random rotations in local sensitive hashing scheme.
hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "relu"):
The non-linear activation function (function or string) in the feed forward layer in the residual attention block.
If string, "gelu", "relu", "swish", "gelu_new" and "gelu_fast" are supported.
hidden_dropout_prob (:obj:`float`, optional, defaults to 0.05):
For more information on feed forward chunking, see `How does Feed Forward Chunking work?
<../glossary.html#feed-forward-chunking>`__.
eos_token_id (:obj:`int`, `optional`, defaults to 2):
The token id for the end-of-sentence token.
feed_forward_size (:obj:`int`, `optional`, defaults to 512):
Dimensionality of the feed_forward layer in the residual attention block.
hash_seed (:obj:`int`, `optional`):
Seed that can be used to make local sensitive hashing in :obj:`LSHSelfAttention` deterministic. This should
only be set for testing purposed. For evaluation and training purposes :obj:`hash_seed` should be left as
:obj:`None` to ensure fully random rotations in local sensitive hashing scheme.
hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"relu"`):
The non-linear activation function (function or string) in the feed forward layer in the residual attention
block.
If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.05):
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
hidden_size (:obj:`int`, optional, defaults to 256):
hidden_size (:obj:`int`, `optional`, defaults to 256):
Dimensionality of the output hidden states of the residual attention blocks.
initializer_range (:obj:`float`, optional, defaults to 0.02):
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
is_decoder (:obj:`bool`, optional, defaults to :obj:`False`):
If `is_decoder` is True, a causal mask is used in addition to `attention_mask`.
When using the Reformer for causal language modeling, `is_decoder` is set to `True`.
layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
is_decoder (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether ot not to use a causal mask in addition to the :obj:`attention_mask` passed to
:class:`~transformers.ReformerModel`. When using the Reformer for causal language modeling, this argument
should be set to :obj:`True`.
layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
The epsilon used by the layer normalization layers.
local_chunk_length (:obj:`int`, optional, defaults to 64):
Length of chunk which attends to itself in LocalSelfAttention. Chunking reduces memory complexity from sequence length x sequence length (self attention) to chunk length x chunk length x sequence length / chunk length (chunked self attention).
local_num_chunks_before (:obj:`int`, optional, defaults to 1):
Number of previous neighbouring chunks to attend to in LocalSelfAttention layer to itself.
local_num_chunks_after (:obj:`int`, optional, defaults to 0):
Number of following neighbouring chunks to attend to in LocalSelfAttention layer in addition to itself.
local_attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
The dropout ratio for the attention probabilities in LocalSelfAttention.
lsh_attn_chunk_length (:obj:`int`, optional, defaults to 64):
Length of chunk which attends to itself in LSHSelfAttention. Chunking reduces memory complexity from sequence length x sequence length (self attention) to chunk length x chunk length x sequence length / chunk length (chunked self attention).
lsh_num_chunks_before (:obj:`int`, optional, defaults to 1):
Number of previous neighbouring chunks to attend to in LSHSelfAttention layer to itself.
lsh_num_chunks_after (:obj:`int`, optional, defaults to 0):
Number of following neighbouring chunks to attend to in LSHSelfAttention layer to itself.
lsh_attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
The dropout ratio for the attention probabilities in LSHSelfAttention.
max_position_embeddings (:obj:`int`, optional, defaults to 4096):
local_chunk_length (:obj:`int`, `optional`, defaults to 64):
Length of chunk which attends to itself in :obj:`LocalSelfAttention`. Chunking reduces memory complexity
from sequence length x sequence length (self attention) to
chunk length x chunk length x sequence length / chunk length (chunked self attention).
local_num_chunks_before (:obj:`int`, `optional`, defaults to 1):
Number of previous neighbouring chunks to attend to in :obj:`LocalSelfAttention` layer to itself.
local_num_chunks_after (:obj:`int`, `optional`, defaults to 0):
Number of following neighbouring chunks to attend to in :obj:`LocalSelfAttention` layer in addition to
itself.
local_attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout ratio for the attention probabilities in :obj:`LocalSelfAttention`.
lsh_attn_chunk_length (:obj:`int`, `optional`, defaults to 64):
Length of chunk which attends to itself in :obj:`LSHSelfAttention`. Chunking reduces memory complexity from
sequence length x sequence length (self attention) to
chunk length x chunk length x sequence length / chunk length (chunked self attention).
lsh_num_chunks_before (:obj:`int`, `optional`, defaults to 1):
Number of previous neighbouring chunks to attend to in :obj:`LSHSelfAttention` layer to itself.
lsh_num_chunks_after (:obj:`int`, `optional`, defaults to 0):
Number of following neighbouring chunks to attend to in :obj:`LSHSelfAttention` layer to itself.
lsh_attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout ratio for the attention probabilities in :obj:`LSHSelfAttention`.
max_position_embeddings (:obj:`int`, `optional`, defaults to 4096):
The maximum sequence length that this model might ever be used with.
Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
num_attention_heads (:obj:`int`, optional, defaults to 12):
num_attention_heads (:obj:`int`, `optional`, defaults to 12):
Number of attention heads for each attention layer in the Transformer encoder.
num_buckets (:obj:`int` or :obj:`list(int)`, optional, defaults to `None`):
Number of buckets, the key query vectors can be "hashed into" using the locality sensitive hashing scheme. Each query key vector is hashed into a hash in `1, ..., num_buckets`.
The number of buckets can also be factorized into a list for improved memory complexity. In this case, each query key vector is hashed into a hash in `1-1, 1-2, ..., num_buckets[0]-1, ..., num_buckets[0]-num_buckets[1]` if `num_buckets` is factorized into two factors.
The number of buckets (or the product the factors) should approximately equal sequence length / lsh_chunk_length. If `num_buckets` is set to `None`, a good value for `num_buckets` is calculated on the fly.
num_hashes (:obj:`int`, optional, defaults to 1):
Number of hashing rounds (e.g. number of random rotations) in Local Sensitive Hashing scheme.
The higher `num_hashes`, the more accurate the `LSHSelfAttention` becomes, but also the more memory and time intensive the hashing becomes.
pad_token_id (:obj:`int`, optional, defaults to 0):
The token id for the <PAD> token.
vocab_size (:obj:`int`, optional, defaults to 320):
Vocabulary size of the Reformer model. Defines the different tokens that
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.ReformerModel`.
num_buckets (:obj:`int` or :obj:`List[int]`, `optional`):
Number of buckets, the key query vectors can be "hashed into" using the locality sensitive hashing scheme.
Each query key vector is hashed into a hash in :obj:`1, ..., num_buckets`.
The number of buckets can also be factorized into a list for improved memory complexity. In this case, each
query key vector is hashed into a hash in
:obj:`1-1, 1-2, ..., num_buckets[0]-1, ..., num_buckets[0]-num_buckets[1]` if :obj:`num_buckets` is
factorized into two factors.
The number of buckets (or the product the factors) should approximately equal
sequence length / lsh_chunk_length. If :obj:`num_buckets` not set, a good value is calculated on the fly.
num_hashes (:obj:`int`, `optional`, defaults to 1):
Number of hashing rounds (e.g., number of random rotations) in Local Sensitive Hashing scheme.
The higher :obj:`num_hashes`, the more accurate the :obj:`LSHSelfAttention` becomes, but also the more
memory and time intensive the hashing becomes.
pad_token_id (:obj:`int`, `optional`, defaults to 0):
The token id for the padding token.
vocab_size (:obj:`int`, `optional`, defaults to 320):\
Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
:obj:`inputs_ids` passed when calling :class:`~transformers.ReformerModel`.
tie_word_embeddings (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to tie input and output embeddings.
Example::
Examples::
>>> from transformers import ReformerModel, ReformerConfig
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment