Unverified Commit 08f534d2 authored by Sylvain Gugger's avatar Sylvain Gugger Committed by GitHub
Browse files

Doc styling (#8067)

* Important files

* Styling them all

* Revert "Styling them all"

This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e.

* Syling them for realsies

* Fix syntax error

* Fix benchmark_utils

* More fixes

* Fix modeling auto and script

* Remove new line

* Fixes

* More fixes

* Fix more files

* Style

* Add FSMT

* More fixes

* More fixes

* More fixes

* More fixes

* Fixes

* More fixes

* More fixes

* Last fixes

* Make sphinx happy
parent 04a17f85
...@@ -3,7 +3,7 @@ MarianMT ...@@ -3,7 +3,7 @@ MarianMT
**Bugs:** If you see something strange, file a `Github Issue **Bugs:** If you see something strange, file a `Github Issue
<https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title>`__ <https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title>`__
and assign @sshleifer. and assign @sshleifer.
Translations should be similar, but not identical to, output in the test set linked to in each model card. Translations should be similar, but not identical to, output in the test set linked to in each model card.
...@@ -12,13 +12,14 @@ Implementation Notes ...@@ -12,13 +12,14 @@ Implementation Notes
- Each model is about 298 MB on disk, there are more than 1,000 models. - Each model is about 298 MB on disk, there are more than 1,000 models.
- The list of supported language pairs can be found `here <https://huggingface.co/Helsinki-NLP>`__. - The list of supported language pairs can be found `here <https://huggingface.co/Helsinki-NLP>`__.
- Models were originally trained by - Models were originally trained by `Jörg Tiedemann
`Jörg Tiedemann <https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann>`__ using the <https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann>`__ using the `Marian
`Marian <https://marian-nmt.github.io/>`__ C++ library, which supports fast training and translation. <https://marian-nmt.github.io/>`__ C++ library, which supports fast training and translation.
- All models are transformer encoder-decoders with 6 layers in each component. Each model's performance is documented - All models are transformer encoder-decoders with 6 layers in each component. Each model's performance is documented
in a model card. in a model card.
- The 80 opus models that require BPE preprocessing are not supported. - The 80 opus models that require BPE preprocessing are not supported.
- The modeling code is the same as :class:`~transformers.BartForConditionalGeneration` with a few minor modifications: - The modeling code is the same as :class:`~transformers.BartForConditionalGeneration` with a few minor modifications:
- static (sinusoid) positional embeddings (:obj:`MarianConfig.static_position_embeddings=True`) - static (sinusoid) positional embeddings (:obj:`MarianConfig.static_position_embeddings=True`)
- a new final_logits_bias (:obj:`MarianConfig.add_bias_logits=True`) - a new final_logits_bias (:obj:`MarianConfig.add_bias_logits=True`)
- no layernorm_embedding (:obj:`MarianConfig.normalize_embedding=False`) - no layernorm_embedding (:obj:`MarianConfig.normalize_embedding=False`)
...@@ -29,17 +30,17 @@ Implementation Notes ...@@ -29,17 +30,17 @@ Implementation Notes
Naming Naming
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}` - All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`
- The language codes used to name models are inconsistent. Two digit codes can usually be found `here - The language codes used to name models are inconsistent. Two digit codes can usually be found `here
<https://developers.google.com/admin-sdk/directory/v1/languages>`__, three digit codes require googling <https://developers.google.com/admin-sdk/directory/v1/languages>`__, three digit codes require googling "language
"language code {code}". code {code}".
- Codes formatted like :obj:`es_AR` are usually :obj:`code_{region}`. That one is Spanish from Argentina. - Codes formatted like :obj:`es_AR` are usually :obj:`code_{region}`. That one is Spanish from Argentina.
Multilingual Models Multilingual Models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`: All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`:
- If :obj:`src` is in all caps, the model supports multiple input languages, you can figure out which ones by - If :obj:`src` is in all caps, the model supports multiple input languages, you can figure out which ones by
looking at the model card, or the Group Members `mapping looking at the model card, or the Group Members `mapping
...@@ -112,6 +113,7 @@ Code to see available pretrained models: ...@@ -112,6 +113,7 @@ Code to see available pretrained models:
MarianConfig MarianConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.MarianConfig .. autoclass:: transformers.MarianConfig
:members: :members:
......
...@@ -7,9 +7,10 @@ MBart ...@@ -7,9 +7,10 @@ MBart
Overview Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The MBart model was presented in `Multilingual Denoising Pre-training for Neural Machine Translation The MBart model was presented in `Multilingual Denoising Pre-training for Neural Machine Translation
<https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov <https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan
Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer. Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual
corpora in many languages using the BART objective. mBART is one of the first methods for pre-training a complete corpora in many languages using the BART objective. mBART is one of the first methods for pre-training a complete
...@@ -21,12 +22,13 @@ The Authors' code can be found `here <https://github.com/pytorch/fairseq/tree/ma ...@@ -21,12 +22,13 @@ The Authors' code can be found `here <https://github.com/pytorch/fairseq/tree/ma
Training Training
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
MBart is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation task.
As the model is multilingual it expects the sequences in a different format. A special language id token
is added in both the source and target text. The source text format is :obj:`X [eos, src_lang_code]`
where :obj:`X` is the source text. The target text format is :obj:`[tgt_lang_code] X [eos]`. :obj:`bos` is never used.
The :meth:`~transformers.MBartTokenizer.prepare_seq2seq_batch` handles this automatically and should be used to encode MBart is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation task. As the model is
multilingual it expects the sequences in a different format. A special language id token is added in both the source
and target text. The source text format is :obj:`X [eos, src_lang_code]` where :obj:`X` is the source text. The target
text format is :obj:`[tgt_lang_code] X [eos]`. :obj:`bos` is never used.
The :meth:`~transformers.MBartTokenizer.prepare_seq2seq_batch` handles this automatically and should be used to encode
the sequences for sequence-to-sequence fine-tuning. the sequences for sequence-to-sequence fine-tuning.
- Supervised training - Supervised training
...@@ -44,8 +46,8 @@ the sequences for sequence-to-sequence fine-tuning. ...@@ -44,8 +46,8 @@ the sequences for sequence-to-sequence fine-tuning.
- Generation - Generation
While generating the target text set the :obj:`decoder_start_token_id` to the target language id. While generating the target text set the :obj:`decoder_start_token_id` to the target language id. The following
The following example shows how to translate English to Romanian using the `facebook/mbart-large-en-ro` model. example shows how to translate English to Romanian using the `facebook/mbart-large-en-ro` model.
.. code-block:: .. code-block::
......
...@@ -14,23 +14,23 @@ The abstract from the paper is the following: ...@@ -14,23 +14,23 @@ The abstract from the paper is the following:
*Natural Language Processing (NLP) has recently achieved great success by using huge pre-trained models with hundreds *Natural Language Processing (NLP) has recently achieved great success by using huge pre-trained models with hundreds
of millions of parameters. However, these models suffer from heavy model sizes and high latency such that they cannot of millions of parameters. However, these models suffer from heavy model sizes and high latency such that they cannot
be deployed to resource-limited mobile devices. In this paper, we propose MobileBERT for compressing and accelerating be deployed to resource-limited mobile devices. In this paper, we propose MobileBERT for compressing and accelerating
the popular BERT model. Like the original BERT, MobileBERT is task-agnostic, that is, it can be generically applied the popular BERT model. Like the original BERT, MobileBERT is task-agnostic, that is, it can be generically applied to
to various downstream NLP tasks via simple fine-tuning. Basically, MobileBERT is a thin version of BERT_LARGE, while various downstream NLP tasks via simple fine-tuning. Basically, MobileBERT is a thin version of BERT_LARGE, while
equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks.
networks. To train MobileBERT, we first train a specially designed teacher model, an inverted-bottleneck incorporated To train MobileBERT, we first train a specially designed teacher model, an inverted-bottleneck incorporated BERT_LARGE
BERT_LARGE model. Then, we conduct knowledge transfer from this teacher to MobileBERT. Empirical studies show that model. Then, we conduct knowledge transfer from this teacher to MobileBERT. Empirical studies show that MobileBERT is
MobileBERT is 4.3x smaller and 5.5x faster than BERT_BASE while achieving competitive results on well-known 4.3x smaller and 5.5x faster than BERT_BASE while achieving competitive results on well-known benchmarks. On the
benchmarks. On the natural language inference tasks of GLUE, MobileBERT achieves a GLUEscore o 77.7 natural language inference tasks of GLUE, MobileBERT achieves a GLUEscore o 77.7 (0.6 lower than BERT_BASE), and 62 ms
(0.6 lower than BERT_BASE), and 62 ms latency on a Pixel 4 phone. On the SQuAD v1.1/v2.0 question answering task, latency on a Pixel 4 phone. On the SQuAD v1.1/v2.0 question answering task, MobileBERT achieves a dev F1 score of
MobileBERT achieves a dev F1 score of 90.0/79.2 (1.5/2.1 higher than BERT_BASE).* 90.0/79.2 (1.5/2.1 higher than BERT_BASE).*
Tips: Tips:
- MobileBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on - MobileBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather
the right rather than the left. than the left.
- MobileBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective. - MobileBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective. It is therefore
It is therefore efficient at predicting masked tokens and at NLU in general, but is not optimal for efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. Models trained
text generation. Models trained with a causal language modeling (CLM) objective are better in that regard. with a causal language modeling (CLM) objective are better in that regard.
The original code can be found `here <https://github.com/google-research/mobilebert>`__. The original code can be found `here <https://github.com/google-research/mobilebert>`__.
......
...@@ -9,9 +9,8 @@ and assign @sshleifer. ...@@ -9,9 +9,8 @@ and assign @sshleifer.
Overview Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The Pegasus model was proposed in `PEGASUS: Pre-training with Extracted Gap-sentences for The Pegasus model was proposed in `PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization
Abstractive Summarization <https://arxiv.org/pdf/1912.08777.pdf>`__ by Jingqing Zhang, Yao Zhao, Mohammad Saleh and <https://arxiv.org/pdf/1912.08777.pdf>`__ by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.
Peter J. Liu on Dec 18, 2019.
According to the abstract, According to the abstract,
...@@ -26,7 +25,7 @@ The Authors' code can be found `here <https://github.com/google-research/pegasus ...@@ -26,7 +25,7 @@ The Authors' code can be found `here <https://github.com/google-research/pegasus
Checkpoints Checkpoints
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All the `checkpoints <https://huggingface.co/models?search=pegasus>`__ are fine-tuned for summarization, besides All the `checkpoints <https://huggingface.co/models?search=pegasus>`__ are fine-tuned for summarization, besides
`pegasus-large`, whence the other checkpoints are fine-tuned: `pegasus-large`, whence the other checkpoints are fine-tuned:
- Each checkpoint is 2.2 GB on disk and 568M parameters. - Each checkpoint is 2.2 GB on disk and 568M parameters.
...@@ -44,6 +43,7 @@ Implementation Notes ...@@ -44,6 +43,7 @@ Implementation Notes
- All models are transformer encoder-decoders with 16 layers in each component. - All models are transformer encoder-decoders with 16 layers in each component.
- The implementation is completely inherited from :class:`~transformers.BartForConditionalGeneration` - The implementation is completely inherited from :class:`~transformers.BartForConditionalGeneration`
- Some key configuration differences: - Some key configuration differences:
- static, sinusoidal position embeddings - static, sinusoidal position embeddings
- no :obj:`layernorm_embedding` (:obj`PegasusConfig.normalize_embedding=False`) - no :obj:`layernorm_embedding` (:obj`PegasusConfig.normalize_embedding=False`)
- the model starts generating with pad_token_id (which has 0 token_embedding) as the prefix. - the model starts generating with pad_token_id (which has 0 token_embedding) as the prefix.
...@@ -84,6 +84,7 @@ PegasusConfig ...@@ -84,6 +84,7 @@ PegasusConfig
PegasusTokenizer PegasusTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
warning: ``add_tokens`` does not work at the moment. warning: ``add_tokens`` does not work at the moment.
.. autoclass:: transformers.PegasusTokenizer .. autoclass:: transformers.PegasusTokenizer
......
...@@ -8,13 +8,24 @@ ProphetNet ...@@ -8,13 +8,24 @@ ProphetNet
Overview Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The ProphetNet model was proposed in `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou on 13 Jan, 2020. The ProphetNet model was proposed in `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training,
<https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei
Zhang, Ming Zhou on 13 Jan, 2020.
ProphetNet is an encoder-decoder model and can predict n-future tokens for "ngram" language modeling instead of just the next token. ProphetNet is an encoder-decoder model and can predict n-future tokens for "ngram" language modeling instead of just
the next token.
The abstract from the paper is the following: The abstract from the paper is the following:
*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.* *In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel
self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of
the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by
n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time
step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent
overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale
dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for
abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new
state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__. The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.
......
RAG RAG
---------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
Overview Overview
~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and
sequence-to-sequence models. RAG models retrieve documents, pass them to a seq2seq model, then marginalize to generate sequence-to-sequence models. RAG models retrieve documents, pass them to a seq2seq model, then marginalize to generate
...@@ -15,46 +15,40 @@ Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäs ...@@ -15,46 +15,40 @@ Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäs
The abstract from the paper is the following: The abstract from the paper is the following:
*Large pre-trained language models have been shown to store factual knowledge *Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve
in their parameters, and achieve state-of-the-art results when fine-tuned on state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely
downstream NLP tasks. However, their ability to access and precisely manipulate manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind
knowledge is still limited, and hence on knowledge-intensive tasks, their task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge
performance lags behind task-specific architectures. Additionally, providing remain open research problems. Pre-trained models with a differentiable access mechanism to explicit nonparametric
provenance for their decisions and updating their world knowledge remain open memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. We explore a
research problems. Pre-trained models with a differentiable access mechanism to general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) — models which combine pre-trained
explicit nonparametric memory can overcome this issue, but have so far been only parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a
investigated for extractive downstream tasks. We explore a general-purpose pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a
fine-tuning recipe for retrieval-augmented generation (RAG) — models which combine pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages
pre-trained parametric and non-parametric memory for language generation. We across the whole generated sequence, the other can use different passages per token. We fine-tune and evaluate our
introduce RAG models where the parametric memory is a pre-trained seq2seq model and models on a wide range of knowledge-intensive NLP tasks and set the state-of-the-art on three open domain QA tasks,
the non-parametric memory is a dense vector index of Wikipedia, accessed with outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation
a pre-trained neural retriever. We compare two RAG formulations, one which tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art
conditions on the same retrieved passages across the whole generated sequence, the parametric-only seq2seq baseline.*
other can use different passages per token. We fine-tune and evaluate our models
on a wide range of knowledge-intensive NLP tasks and set the state-of-the-art
on three open domain QA tasks, outperforming parametric seq2seq models and
task-specific retrieve-and-extract architectures. For language generation tasks, we
find that RAG models generate more specific, diverse and factual language than a
state-of-the-art parametric-only seq2seq baseline.*
RagConfig RagConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.RagConfig .. autoclass:: transformers.RagConfig
:members: :members:
RagTokenizer RagTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.RagTokenizer .. autoclass:: transformers.RagTokenizer
:members: prepare_seq2seq_batch :members: prepare_seq2seq_batch
Rag specific outputs Rag specific outputs
~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.modeling_rag.RetrievAugLMMarginOutput .. autoclass:: transformers.modeling_rag.RetrievAugLMMarginOutput
:members: :members:
...@@ -63,28 +57,28 @@ Rag specific outputs ...@@ -63,28 +57,28 @@ Rag specific outputs
:members: :members:
RagRetriever RagRetriever
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.RagRetriever .. autoclass:: transformers.RagRetriever
:members: :members:
RagModel RagModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.RagModel .. autoclass:: transformers.RagModel
:members: forward :members: forward
RagSequenceForGeneration RagSequenceForGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.RagSequenceForGeneration .. autoclass:: transformers.RagSequenceForGeneration
:members: forward, generate :members: forward, generate
RagTokenForGeneration RagTokenForGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.RagTokenForGeneration .. autoclass:: transformers.RagTokenForGeneration
:members: forward, generate :members: forward, generate
...@@ -10,7 +10,7 @@ Overview ...@@ -10,7 +10,7 @@ Overview
The Reformer model was proposed in the paper `Reformer: The Efficient Transformer The Reformer model was proposed in the paper `Reformer: The Efficient Transformer
<https://arxiv.org/abs/2001.04451.pdf>`__ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya. <https://arxiv.org/abs/2001.04451.pdf>`__ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
The abstract from the paper is the following: The abstract from the paper is the following:
*Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can *Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can
be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of
...@@ -36,12 +36,12 @@ would result in a position encoding matrix: ...@@ -36,12 +36,12 @@ would result in a position encoding matrix:
.. math:: .. math::
X_{i,j}, \text{ with } i \in \left[1,\ldots, d\right] \text{ and } j \in \left[1,\ldots, n_s\right] X_{i,j}, \text{ with } i \in \left[1,\ldots, d\right] \text{ and } j \in \left[1,\ldots, n_s\right]
which alone has over 500M parameters to store. Axial positional encodings factorize :math:`X_{i,j}` into two matrices: which alone has over 500M parameters to store. Axial positional encodings factorize :math:`X_{i,j}` into two matrices:
.. math:: .. math::
X^{1}_{i,j}, \text{ with } i \in \left[1,\ldots, d^1\right] \text{ and } j \in \left[1,\ldots, n_s^1\right] X^{1}_{i,j}, \text{ with } i \in \left[1,\ldots, d^1\right] \text{ and } j \in \left[1,\ldots, n_s^1\right]
and and
.. math:: .. math::
X^{2}_{i,j}, \text{ with } i \in \left[1,\ldots, d^2\right] \text{ and } j \in \left[1,\ldots, n_s^2\right] X^{2}_{i,j}, \text{ with } i \in \left[1,\ldots, d^2\right] \text{ and } j \in \left[1,\ldots, n_s^2\right]
...@@ -67,22 +67,23 @@ factorized embedding vectors: :math:`x^1_{k, l} + x^2_{l, k}`, where as the :obj ...@@ -67,22 +67,23 @@ factorized embedding vectors: :math:`x^1_{k, l} + x^2_{l, k}`, where as the :obj
Using the above example again, axial position encoding with :math:`d^1 = 2^5, d^2 = 2^5, n_s^1 = 2^9, n_s^2 = 2^{10}` Using the above example again, axial position encoding with :math:`d^1 = 2^5, d^2 = 2^5, n_s^1 = 2^9, n_s^2 = 2^{10}`
can drastically reduced the number of parameters to :math:`2^{14} + 2^{15} \approx 49000` parameters. can drastically reduced the number of parameters to :math:`2^{14} + 2^{15} \approx 49000` parameters.
In practice, the parameter :obj:`config.axial_pos_embds_dim` is set to a tuple :math:`(d^1, d^2)` which sum has to In practice, the parameter :obj:`config.axial_pos_embds_dim` is set to a tuple :math:`(d^1, d^2)` which sum has to be
be equal to :obj:`config.hidden_size` and :obj:`config.axial_pos_shape` is set to a tuple :math:`(n_s^1, n_s^2)` which equal to :obj:`config.hidden_size` and :obj:`config.axial_pos_shape` is set to a tuple :math:`(n_s^1, n_s^2)` which
product has to be equal to :obj:`config.max_embedding_size`, which during training has to be equal to the product has to be equal to :obj:`config.max_embedding_size`, which during training has to be equal to the `sequence
`sequence length` of the :obj:`input_ids`. length` of the :obj:`input_ids`.
LSH Self Attention LSH Self Attention
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In Locality sensitive hashing (LSH) self attention the key and query projection weights are tied. Therefore, the key In Locality sensitive hashing (LSH) self attention the key and query projection weights are tied. Therefore, the key
query embedding vectors are also tied. LSH self attention uses the locality sensitive hashing mechanism proposed in query embedding vectors are also tied. LSH self attention uses the locality sensitive hashing mechanism proposed in
`Practical and Optimal LSH for Angular Distance <https://arxiv.org/abs/1509.02897>`__ to assign each of the tied key `Practical and Optimal LSH for Angular Distance <https://arxiv.org/abs/1509.02897>`__ to assign each of the tied key
query embedding vectors to one of :obj:`config.num_buckets` possible buckets. The premise is that the more "similar" query embedding vectors to one of :obj:`config.num_buckets` possible buckets. The premise is that the more "similar"
key query embedding vectors (in terms of *cosine similarity*) are to each other, the more likely they are assigned to key query embedding vectors (in terms of *cosine similarity*) are to each other, the more likely they are assigned to
the same bucket. the same bucket.
The accuracy of the LSH mechanism can be improved by increasing :obj:`config.num_hashes` or directly the argument The accuracy of the LSH mechanism can be improved by increasing :obj:`config.num_hashes` or directly the argument
:obj:`num_hashes` of the forward function so that the output of the LSH self attention better approximates the output :obj:`num_hashes` of the forward function so that the output of the LSH self attention better approximates the output
of the "normal" full self attention. The buckets are then sorted and chunked into query key embedding vector chunks of the "normal" full self attention. The buckets are then sorted and chunked into query key embedding vector chunks
each of length :obj:`config.lsh_chunk_length`. For each chunk, the query embedding vectors attend to its key vectors each of length :obj:`config.lsh_chunk_length`. For each chunk, the query embedding vectors attend to its key vectors
...@@ -92,11 +93,11 @@ neighboring chunks and :obj:`config.lsh_num_chunks_after` following neighboring ...@@ -92,11 +93,11 @@ neighboring chunks and :obj:`config.lsh_num_chunks_after` following neighboring
For more information, see the `original Paper <https://arxiv.org/abs/2001.04451>`__ or this great `blog post For more information, see the `original Paper <https://arxiv.org/abs/2001.04451>`__ or this great `blog post
<https://www.pragmatic.ml/reformer-deep-dive/>`__. <https://www.pragmatic.ml/reformer-deep-dive/>`__.
Note that :obj:`config.num_buckets` can also be factorized into a list Note that :obj:`config.num_buckets` can also be factorized into a list :math:`(n_{\text{buckets}}^1,
:math:`(n_{\text{buckets}}^1, n_{\text{buckets}}^2)`. This way instead of assigning the query key embedding vectors to n_{\text{buckets}}^2)`. This way instead of assigning the query key embedding vectors to one of :math:`(1,\ldots,
one of :math:`(1,\ldots, n_{\text{buckets}})` they are assigned to one of n_{\text{buckets}})` they are assigned to one of :math:`(1-1,\ldots, n_{\text{buckets}}^1-1, \ldots,
:math:`(1-1,\ldots, n_{\text{buckets}}^1-1, \ldots, 1-n_{\text{buckets}}^2, \ldots, n_{\text{buckets}}^1-n_{\text{buckets}}^2)`. 1-n_{\text{buckets}}^2, \ldots, n_{\text{buckets}}^1-n_{\text{buckets}}^2)`. This is crucial for very long sequences to
This is crucial for very long sequences to save memory. save memory.
When training a model from scratch, it is recommended to leave :obj:`config.num_buckets=None`, so that depending on the When training a model from scratch, it is recommended to leave :obj:`config.num_buckets=None`, so that depending on the
sequence length a good value for :obj:`num_buckets` is calculated on the fly. This value will then automatically be sequence length a good value for :obj:`num_buckets` is calculated on the fly. This value will then automatically be
...@@ -128,7 +129,7 @@ multiple of :obj:`config.lsh_chunk_length` and :obj:`config.local_chunk_length` ...@@ -128,7 +129,7 @@ multiple of :obj:`config.lsh_chunk_length` and :obj:`config.local_chunk_length`
Positional Encodings are correctly set as described above. Reformer is very memory efficient so that the model can Positional Encodings are correctly set as described above. Reformer is very memory efficient so that the model can
easily be trained on sequences as long as 64000 tokens. easily be trained on sequences as long as 64000 tokens.
For training, the :class:`~transformers.ReformerModelWithLMHead` should be used as follows: For training, the :class:`~transformers.ReformerModelWithLMHead` should be used as follows:
.. code-block:: .. code-block::
......
...@@ -8,8 +8,8 @@ The RoBERTa model was proposed in `RoBERTa: A Robustly Optimized BERT Pretrainin ...@@ -8,8 +8,8 @@ The RoBERTa model was proposed in `RoBERTa: A Robustly Optimized BERT Pretrainin
<https://arxiv.org/abs/1907.11692>`_ by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer <https://arxiv.org/abs/1907.11692>`_ by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. It is based on Google's BERT model released in 2018. Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. It is based on Google's BERT model released in 2018.
It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with
objective and training with much larger mini-batches and learning rates. much larger mini-batches and learning rates.
The abstract from the paper is the following: The abstract from the paper is the following:
...@@ -17,15 +17,15 @@ The abstract from the paper is the following: ...@@ -17,15 +17,15 @@ The abstract from the paper is the following:
approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes,
and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication
study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and
training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every
every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results
results highlight the importance of previously overlooked design choices, and raise questions about the source highlight the importance of previously overlooked design choices, and raise questions about the source of recently
of recently reported improvements. We release our models and code.* reported improvements. We release our models and code.*
Tips: Tips:
- This implementation is the same as :class:`~transformers.BertModel` with a tiny embeddings tweak as well as a - This implementation is the same as :class:`~transformers.BertModel` with a tiny embeddings tweak as well as a setup
setup for Roberta pretrained models. for Roberta pretrained models.
- RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a - RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a
different pretraining scheme. different pretraining scheme.
- RoBERTa doesn't have :obj:`token_type_ids`, you don't need to indicate which token belongs to which segment. Just - RoBERTa doesn't have :obj:`token_type_ids`, you don't need to indicate which token belongs to which segment. Just
......
...@@ -4,38 +4,34 @@ SqueezeBERT ...@@ -4,38 +4,34 @@ SqueezeBERT
Overview Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The SqueezeBERT model was proposed in The SqueezeBERT model was proposed in `SqueezeBERT: What can computer vision teach NLP about efficient neural networks?
`SqueezeBERT: What can computer vision teach NLP about efficient neural networks? <https://arxiv.org/abs/2006.11316>`__ by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, Kurt W. Keutzer. It's a
<https://arxiv.org/abs/2006.11316>`__ bidirectional transformer similar to the BERT model. The key difference between the BERT architecture and the
by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, Kurt W. Keutzer. SqueezeBERT architecture is that SqueezeBERT uses `grouped convolutions <https://blog.yani.io/filter-group-tutorial>`__
It's a bidirectional transformer similar to the BERT model.
The key difference between the BERT architecture and the SqueezeBERT architecture
is that SqueezeBERT uses `grouped convolutions <https://blog.yani.io/filter-group-tutorial>`__
instead of fully-connected layers for the Q, K, V and FFN layers. instead of fully-connected layers for the Q, K, V and FFN layers.
The abstract from the paper is the following: The abstract from the paper is the following:
*Humans read and write hundreds of billions of messages every day. Further, due to the availability of *Humans read and write hundreds of billions of messages every day. Further, due to the availability of large datasets,
large datasets, large computing systems, and better neural network models, natural language processing (NLP) large computing systems, and better neural network models, natural language processing (NLP) technology has made
technology has made significant strides in understanding, proofreading, and organizing these messages. significant strides in understanding, proofreading, and organizing these messages. Thus, there is a significant
Thus, there is a significant opportunity to deploy NLP in myriad applications to help web users, opportunity to deploy NLP in myriad applications to help web users, social networks, and businesses. In particular, we
social networks, and businesses. In particular, we consider smartphones and other mobile devices as consider smartphones and other mobile devices as crucial platforms for deploying NLP models at scale. However, today's
crucial platforms for deploying NLP models at scale. However, today's highly-accurate NLP neural network highly-accurate NLP neural network models such as BERT and RoBERTa are extremely computationally expensive, with
models such as BERT and RoBERTa are extremely computationally expensive, with BERT-base taking 1.7 seconds BERT-base taking 1.7 seconds to classify a text snippet on a Pixel 3 smartphone. In this work, we observe that methods
to classify a text snippet on a Pixel 3 smartphone. In this work, we observe that methods such as grouped such as grouped convolutions have yielded significant speedups for computer vision networks, but many of these
convolutions have yielded significant speedups for computer vision networks, but many of these techniques techniques have not been adopted by NLP neural network designers. We demonstrate how to replace several operations in
have not been adopted by NLP neural network designers. We demonstrate how to replace several operations in self-attention layers with grouped convolutions, and we use this technique in a novel network architecture called
self-attention layers with grouped convolutions, and we use this technique in a novel network architecture SqueezeBERT, which runs 4.3x faster than BERT-base on the Pixel 3 while achieving competitive accuracy on the GLUE test
called SqueezeBERT, which runs 4.3x faster than BERT-base on the Pixel 3 while achieving competitive set. The SqueezeBERT code will be released.*
accuracy on the GLUE test set. The SqueezeBERT code will be released.*
Tips: Tips:
- SqueezeBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on - SqueezeBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right
the right rather than the left. rather than the left.
- SqueezeBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective. - SqueezeBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective. It is therefore
It is therefore efficient at predicting masked tokens and at NLU in general, but is not optimal for efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. Models trained
text generation. Models trained with a causal language modeling (CLM) objective are better in that regard. with a causal language modeling (CLM) objective are better in that regard.
- For best results when finetuning on sequence classification tasks, it is recommended to start with the - For best results when finetuning on sequence classification tasks, it is recommended to start with the
`squeezebert/squeezebert-mnli-headless` checkpoint. `squeezebert/squeezebert-mnli-headless` checkpoint.
......
...@@ -29,13 +29,12 @@ Tips: ...@@ -29,13 +29,12 @@ Tips:
each task is converted into a text-to-text format. T5 works well on a variety of tasks out-of-the-box by prepending a each task is converted into a text-to-text format. T5 works well on a variety of tasks out-of-the-box by prepending a
different prefix to the input corresponding to each task, e.g., for translation: *translate English to German: ...*, different prefix to the input corresponding to each task, e.g., for translation: *translate English to German: ...*,
for summarization: *summarize: ...*. for summarization: *summarize: ...*.
For more information about which prefix to use, it is easiest to look into Appendix D of the `paper For more information about which prefix to use, it is easiest to look into Appendix D of the `paper
<https://arxiv.org/pdf/1910.10683.pdf>`__. <https://arxiv.org/pdf/1910.10683.pdf>`__. - For sequence-to-sequence generation, it is recommended to use
- For sequence-to-sequence generation, it is recommended to use :obj:`T5ForConditionalGeneration.generate()``. This :obj:`T5ForConditionalGeneration.generate()``. This method takes care of feeding the encoded input via
method takes care of feeding the encoded input via cross-attention layers to the decoder and auto-regressively cross-attention layers to the decoder and auto-regressively generates the decoder output. - T5 uses relative scalar
generates the decoder output. embeddings. Encoder input padding can be done on the left and on the right.
- T5 uses relative scalar embeddings. Encoder input padding can be done on the left and on the right.
The original code can be found `here <https://github.com/google-research/text-to-text-transfer-transformer>`__. The original code can be found `here <https://github.com/google-research/text-to-text-transfer-transformer>`__.
...@@ -51,14 +50,14 @@ token. T5 can be trained / fine-tuned both in a supervised and unsupervised fash ...@@ -51,14 +50,14 @@ token. T5 can be trained / fine-tuned both in a supervised and unsupervised fash
- Unsupervised denoising training - Unsupervised denoising training
In this setup spans of the input sequence are masked by so-called sentinel tokens (*a.k.a* unique mask tokens) In this setup spans of the input sequence are masked by so-called sentinel tokens (*a.k.a* unique mask tokens) and
and the output sequence is formed as a concatenation of the same sentinel tokens and the *real* masked tokens. the output sequence is formed as a concatenation of the same sentinel tokens and the *real* masked tokens. Each
Each sentinel token represents a unique mask token for this sentence and should start with :obj:`<extra_id_0>`, sentinel token represents a unique mask token for this sentence and should start with :obj:`<extra_id_0>`,
:obj:`<extra_id_1>`, ... up to :obj:`<extra_id_99>`. As a default, 100 sentinel tokens are available in :obj:`<extra_id_1>`, ... up to :obj:`<extra_id_99>`. As a default, 100 sentinel tokens are available in
:class:`~transformers.T5Tokenizer`. :class:`~transformers.T5Tokenizer`.
For instance, the sentence "The cute dog walks in the park" with the masks put on "cute dog" and "the" should be For instance, the sentence "The cute dog walks in the park" with the masks put on "cute dog" and "the" should be
processed as follows: processed as follows:
.. code-block:: .. code-block::
...@@ -69,10 +68,10 @@ token. T5 can be trained / fine-tuned both in a supervised and unsupervised fash ...@@ -69,10 +68,10 @@ token. T5 can be trained / fine-tuned both in a supervised and unsupervised fash
- Supervised training - Supervised training
In this setup the input sequence and output sequence are standard sequence-to-sequence input output mapping. In this setup the input sequence and output sequence are standard sequence-to-sequence input output mapping. In
In translation, for instance with the input sequence "The house is wonderful." and output sequence "Das Haus ist translation, for instance with the input sequence "The house is wonderful." and output sequence "Das Haus ist
wunderbar.", the sentences should be processed as follows: wunderbar.", the sentences should be processed as follows:
.. code-block:: .. code-block::
input_ids = tokenizer('translate English to German: The house is wonderful.', return_tensors='pt').input_ids input_ids = tokenizer('translate English to German: The house is wonderful.', return_tensors='pt').input_ids
......
...@@ -14,19 +14,19 @@ The abstract from the paper is the following: ...@@ -14,19 +14,19 @@ The abstract from the paper is the following:
*Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the *Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the
setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency
beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a
a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the
the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450%
450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+
to 1,800+ times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results of
of bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn
Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably
coherent, novel text articles with thousands of tokens.* coherent, novel text articles with thousands of tokens.*
Tips: Tips:
- Transformer-XL uses relative sinusoidal positional embeddings. Padding can be done on the left or on the right. - Transformer-XL uses relative sinusoidal positional embeddings. Padding can be done on the left or on the right. The
The original implementation trains on SQuAD with padding on the left, therefore the padding defaults are set to left. original implementation trains on SQuAD with padding on the left, therefore the padding defaults are set to left.
- Transformer-XL is one of the few models that has no sequence length limit. - Transformer-XL is one of the few models that has no sequence length limit.
The original code can be found `here <https://github.com/kimiyoung/transformer-xl>`__. The original code can be found `here <https://github.com/kimiyoung/transformer-xl>`__.
......
...@@ -14,21 +14,21 @@ Guillaume Lample, Alexis Conneau. It's a transformer pretrained using one of the ...@@ -14,21 +14,21 @@ Guillaume Lample, Alexis Conneau. It's a transformer pretrained using one of the
The abstract from the paper is the following: The abstract from the paper is the following:
*Recent studies have demonstrated the efficiency of generative pretraining for English natural language understanding. *Recent studies have demonstrated the efficiency of generative pretraining for English natural language understanding.
In this work, we extend this approach to multiple languages and show the effectiveness of cross-lingual pretraining. In this work, we extend this approach to multiple languages and show the effectiveness of cross-lingual pretraining. We
We propose two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual propose two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual
data, and one supervised that leverages parallel data with a new cross-lingual language model objective. We obtain data, and one supervised that leverages parallel data with a new cross-lingual language model objective. We obtain
state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation. On XNLI, state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation. On XNLI, our
our approach pushes the state of the art by an absolute gain of 4.9% accuracy. On unsupervised machine translation, approach pushes the state of the art by an absolute gain of 4.9% accuracy. On unsupervised machine translation, we
we obtain 34.3 BLEU on WMT'16 German-English, improving the previous state of the art by more than 9 BLEU. On obtain 34.3 BLEU on WMT'16 German-English, improving the previous state of the art by more than 9 BLEU. On supervised
supervised machine translation, we obtain a new state of the art of 38.5 BLEU on WMT'16 Romanian-English, outperforming machine translation, we obtain a new state of the art of 38.5 BLEU on WMT'16 Romanian-English, outperforming the
the previous best approach by more than 4 BLEU. Our code and pretrained models will be made publicly available.* previous best approach by more than 4 BLEU. Our code and pretrained models will be made publicly available.*
Tips: Tips:
- XLM has many different checkpoints, which were trained using different objectives: CLM, MLM or TLM. Make sure to - XLM has many different checkpoints, which were trained using different objectives: CLM, MLM or TLM. Make sure to
select the correct objective for your task (e.g. MLM checkpoints are not suitable for generation). select the correct objective for your task (e.g. MLM checkpoints are not suitable for generation).
- XLM has multilingual checkpoints which leverage a specific :obj:`lang` parameter. Check out the - XLM has multilingual checkpoints which leverage a specific :obj:`lang` parameter. Check out the :doc:`multi-lingual
:doc:`multi-lingual <../multilingual>` page for more information. <../multilingual>` page for more information.
The original code can be found `here <https://github.com/facebookresearch/XLM/>`__. The original code can be found `here <https://github.com/facebookresearch/XLM/>`__.
......
...@@ -9,13 +9,25 @@ XLM-ProphetNet ...@@ -9,13 +9,25 @@ XLM-ProphetNet
Overview Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The XLM-ProphetNet model was proposed in `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou on 13 Jan, 2020. The XLM-ProphetNet model was proposed in `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training,
<https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei
Zhang, Ming Zhou on 13 Jan, 2020.
XLM-ProphetNet is an encoder-decoder model and can predict n-future tokens for "ngram" language modeling instead of just the next token. Its architecture is identical to ProhpetNet, but the model was trained on the multi-lingual "wiki100" Wikipedia dump. XLM-ProphetNet is an encoder-decoder model and can predict n-future tokens for "ngram" language modeling instead of
just the next token. Its architecture is identical to ProhpetNet, but the model was trained on the multi-lingual
"wiki100" Wikipedia dump.
The abstract from the paper is the following: The abstract from the paper is the following:
*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.* *In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel
self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of
the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by
n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time
step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent
overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale
dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for
abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new
state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__. The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.
......
...@@ -12,25 +12,25 @@ data. ...@@ -12,25 +12,25 @@ data.
The abstract from the paper is the following: The abstract from the paper is the following:
*This paper shows that pretraining multilingual language models at scale leads to significant performance gains for *This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a
a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred
languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly
outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy on
on XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on
low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model. low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model. We
We also present a detailed empirical evaluation of the key factors that are required to achieve these gains, also present a detailed empirical evaluation of the key factors that are required to achieve these gains, including the
including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource
low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing
without sacrificing per-language performance; XLM-Ris very competitive with strong monolingual models on the GLUE per-language performance; XLM-Ris very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We
and XNLI benchmarks. We will make XLM-R code, data, and models publicly available.* will make XLM-R code, data, and models publicly available.*
Tips: Tips:
- XLM-RoBERTa is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does - XLM-RoBERTa is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does
not require :obj:`lang` tensors to understand which language is used, and should be able to determine the correct not require :obj:`lang` tensors to understand which language is used, and should be able to determine the correct
language from the input ids. language from the input ids.
- This implementation is the same as RoBERTa. Refer to the :doc:`documentation of RoBERTa <roberta>` for usage - This implementation is the same as RoBERTa. Refer to the :doc:`documentation of RoBERTa <roberta>` for usage examples
examples as well as the information relative to the inputs and outputs. as well as the information relative to the inputs and outputs.
The original code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/xlmr>`__. The original code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/xlmr>`__.
......
...@@ -16,11 +16,11 @@ The abstract from the paper is the following: ...@@ -16,11 +16,11 @@ The abstract from the paper is the following:
better performance than pretraining approaches based on autoregressive language modeling. However, relying on better performance than pretraining approaches based on autoregressive language modeling. However, relying on
corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a
pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive
pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all
all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive
formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into
into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large
a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.* margin, including question answering, natural language inference, sentiment analysis, and document ranking.*
Tips: Tips:
......
...@@ -15,8 +15,8 @@ Prepare your model for uploading ...@@ -15,8 +15,8 @@ Prepare your model for uploading
We have seen in the :doc:`training tutorial <training>`: how to fine-tune a model on a given task. You have probably We have seen in the :doc:`training tutorial <training>`: how to fine-tune a model on a given task. You have probably
done something similar on your task, either using the model directly in your own training loop or using the done something similar on your task, either using the model directly in your own training loop or using the
:class:`~.transformers.Trainer`/:class:`~.transformers.TFTrainer` class. Let's see how you can share the result on :class:`~.transformers.Trainer`/:class:`~.transformers.TFTrainer` class. Let's see how you can share the result on the
the `model hub <https://huggingface.co/models>`__. `model hub <https://huggingface.co/models>`__.
Basic steps Basic steps
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...@@ -60,22 +60,20 @@ Make your model work on all frameworks ...@@ -60,22 +60,20 @@ Make your model work on all frameworks
You probably have your favorite framework, but so will other users! That's why it's best to upload your model with both You probably have your favorite framework, but so will other users! That's why it's best to upload your model with both
PyTorch `and` TensorFlow checkpoints to make it easier to use (if you skip this step, users will still be able to load PyTorch `and` TensorFlow checkpoints to make it easier to use (if you skip this step, users will still be able to load
your model in another framework, but it will be slower, as it will have to be converted on the fly). Don't worry, it's super easy to do (and in a future version, your model in another framework, but it will be slower, as it will have to be converted on the fly). Don't worry, it's
it will all be automatic). You will need to install both PyTorch and TensorFlow for this step, but you don't need to super easy to do (and in a future version, it will all be automatic). You will need to install both PyTorch and
worry about the GPU, so it should be very easy. Check the TensorFlow for this step, but you don't need to worry about the GPU, so it should be very easy. Check the `TensorFlow
`TensorFlow installation page <https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available>`__ installation page <https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available>`__ and/or the `PyTorch
and/or the `PyTorch installation page <https://pytorch.org/get-started/locally/#start-locally>`__ to see how. installation page <https://pytorch.org/get-started/locally/#start-locally>`__ to see how.
First check that your model class exists in the other framework, that is try to import the same model by either adding First check that your model class exists in the other framework, that is try to import the same model by either adding
or removing TF. For instance, if you trained a :class:`~transformers.DistilBertForSequenceClassification`, try to or removing TF. For instance, if you trained a :class:`~transformers.DistilBertForSequenceClassification`, try to type
type
.. code-block:: .. code-block::
from transformers import TFDistilBertForSequenceClassification from transformers import TFDistilBertForSequenceClassification
and if you trained a :class:`~transformers.TFDistilBertForSequenceClassification`, try to and if you trained a :class:`~transformers.TFDistilBertForSequenceClassification`, try to type
type
.. code-block:: .. code-block::
...@@ -112,7 +110,8 @@ Make sure there are no garbage files in the directory you'll upload. It should o ...@@ -112,7 +110,8 @@ Make sure there are no garbage files in the directory you'll upload. It should o
- a `tf_model.h5` file, which is the TensorFlow checkpoint (unless you can't have it for some reason) ; - a `tf_model.h5` file, which is the TensorFlow checkpoint (unless you can't have it for some reason) ;
- a `special_tokens_map.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save; - a `special_tokens_map.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save;
- a `tokenizer_config.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save; - a `tokenizer_config.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save;
- files named `vocab.json`, `vocab.txt`, `merges.txt`, or similar, which contain the vocabulary of your tokenizer, part of your :doc:`tokenizer <main_classes/tokenizer>` save; - files named `vocab.json`, `vocab.txt`, `merges.txt`, or similar, which contain the vocabulary of your tokenizer, part
of your :doc:`tokenizer <main_classes/tokenizer>` save;
- maybe a `added_tokens.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save. - maybe a `added_tokens.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save.
Other files can safely be deleted. Other files can safely be deleted.
...@@ -135,7 +134,8 @@ Then log in using the same credentials as on huggingface.co. To upload your mode ...@@ -135,7 +134,8 @@ Then log in using the same credentials as on huggingface.co. To upload your mode
This will upload the folder containing the weights, tokenizer and configuration we prepared in the previous section. This will upload the folder containing the weights, tokenizer and configuration we prepared in the previous section.
By default you will be prompted to confirm that you want these files to be uploaded. If you are uploading multiple models and need to script that process, you can add `-y` to bypass the prompt. For example: By default you will be prompted to confirm that you want these files to be uploaded. If you are uploading multiple
models and need to script that process, you can add `-y` to bypass the prompt. For example:
.. code-block:: .. code-block::
...@@ -179,15 +179,15 @@ Add a model card ...@@ -179,15 +179,15 @@ Add a model card
To make sure everyone knows what your model can do, what its limitations and potential bias or ethetical To make sure everyone knows what your model can do, what its limitations and potential bias or ethetical
considerations, please add a README.md model card to the 🤗 Transformers repo under `model_cards/`. It should then be considerations, please add a README.md model card to the 🤗 Transformers repo under `model_cards/`. It should then be
placed in a subfolder with your username or organization, then another subfolder named like your model placed in a subfolder with your username or organization, then another subfolder named like your model
(`awesome-name-you-picked`). Or just click on the "Create a model card on GitHub" button on the model page, it will (`awesome-name-you-picked`). Or just click on the "Create a model card on GitHub" button on the model page, it will get
get you directly to the right location. If you need one, `here <https://github.com/huggingface/model_card>`__ is a you directly to the right location. If you need one, `here <https://github.com/huggingface/model_card>`__ is a model
model card template (meta-suggestions are welcome). card template (meta-suggestions are welcome).
If your model is fine-tuned from another model coming from the model hub (all 🤗 Transformers pretrained models do), If your model is fine-tuned from another model coming from the model hub (all 🤗 Transformers pretrained models do),
don't forget to link to its model card so that people can fully trace how your model was built. don't forget to link to its model card so that people can fully trace how your model was built.
If you have never made a pull request to the 🤗 Transformers repo, look at the If you have never made a pull request to the 🤗 Transformers repo, look at the :doc:`contributing guide <contributing>`
:doc:`contributing guide <contributing>` to see the steps to follow. to see the steps to follow.
.. Note:: .. Note::
......
Summary of the models Summary of the models
======================================================================================================================= =======================================================================================================================
This is a summary of the models available in 🤗 Transformers. It assumes youre familiar with the original This is a summary of the models available in 🤗 Transformers. It assumes youre familiar with the original `transformer
`transformer model <https://arxiv.org/abs/1706.03762>`_. For a gentle introduction check the `annotated transformer model <https://arxiv.org/abs/1706.03762>`_. For a gentle introduction check the `annotated transformer
<http://nlp.seas.harvard.edu/2018/04/03/attention.html>`_. Here we focus on the high-level differences between the <http://nlp.seas.harvard.edu/2018/04/03/attention.html>`_. Here we focus on the high-level differences between the
models. You can check them more in detail in their respective documentation. Also checkout the models. You can check them more in detail in their respective documentation. Also checkout the :doc:`pretrained model
:doc:`pretrained model page </pretrained_models>` to see the checkpoints available for each type of model and all `the page </pretrained_models>` to see the checkpoints available for each type of model and all `the community models
community models <https://huggingface.co/models>`_. <https://huggingface.co/models>`_.
Each one of the models in the library falls into one of the following categories: Each one of the models in the library falls into one of the following categories:
...@@ -19,8 +19,8 @@ Each one of the models in the library falls into one of the following categories ...@@ -19,8 +19,8 @@ Each one of the models in the library falls into one of the following categories
Autoregressive models are pretrained on the classic language modeling task: guess the next token having read all the Autoregressive models are pretrained on the classic language modeling task: guess the next token having read all the
previous ones. They correspond to the decoder of the original transformer model, and a mask is used on top of the full previous ones. They correspond to the decoder of the original transformer model, and a mask is used on top of the full
sentence so that the attention heads can only see what was before in the next, and not whats after. Although those sentence so that the attention heads can only see what was before in the next, and not whats after. Although those
models can be fine-tuned and achieve great results on many tasks, the most natural application is text generation. models can be fine-tuned and achieve great results on many tasks, the most natural application is text generation. A
A typical example of such models is GPT. typical example of such models is GPT.
Autoencoding models are pretrained by corrupting the input tokens in some way and trying to reconstruct the original Autoencoding models are pretrained by corrupting the input tokens in some way and trying to reconstruct the original
sentence. They correspond to the encoder of the original transformer model in the sense that they get access to the sentence. They correspond to the encoder of the original transformer model in the sense that they get access to the
...@@ -30,8 +30,8 @@ sentence classification or token classification. A typical example of such model ...@@ -30,8 +30,8 @@ sentence classification or token classification. A typical example of such model
Note that the only difference between autoregressive models and autoencoding models is in the way the model is Note that the only difference between autoregressive models and autoencoding models is in the way the model is
pretrained. Therefore, the same architecture can be used for both autoregressive and autoencoding models. When a given pretrained. Therefore, the same architecture can be used for both autoregressive and autoencoding models. When a given
model has been used for both types of pretraining, we have put it in the category corresponding to the article where it was first model has been used for both types of pretraining, we have put it in the category corresponding to the article where it
introduced. was first introduced.
Sequence-to-sequence models use both the encoder and the decoder of the original transformer, either for translation Sequence-to-sequence models use both the encoder and the decoder of the original transformer, either for translation
tasks or by transforming other tasks to sequence-to-sequence problems. They can be fine-tuned to many tasks but their tasks or by transforming other tasks to sequence-to-sequence problems. They can be fine-tuned to many tasks but their
...@@ -60,8 +60,8 @@ Original GPT ...@@ -60,8 +60,8 @@ Original GPT
<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-openai--gpt-blueviolet"> <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-openai--gpt-blueviolet">
</a> </a>
`Improving Language Understanding by Generative Pre-Training <https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf>`_, `Improving Language Understanding by Generative Pre-Training
Alec Radford et al. <https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf>`_, Alec Radford et al.
The first autoregressive model based on the transformer architecture, pretrained on the Book Corpus dataset. The first autoregressive model based on the transformer architecture, pretrained on the Book Corpus dataset.
...@@ -80,7 +80,8 @@ GPT-2 ...@@ -80,7 +80,8 @@ GPT-2
<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-gpt2-blueviolet"> <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-gpt2-blueviolet">
</a> </a>
`Language Models are Unsupervised Multitask Learners <https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`_, `Language Models are Unsupervised Multitask Learners
<https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`_,
Alec Radford et al. Alec Radford et al.
A bigger and better version of GPT, pretrained on WebText (web pages from outgoing links in Reddit with 3 karmas or A bigger and better version of GPT, pretrained on WebText (web pages from outgoing links in Reddit with 3 karmas or
...@@ -122,8 +123,8 @@ Transformer-XL ...@@ -122,8 +123,8 @@ Transformer-XL
<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-transfo--xl-blueviolet"> <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-transfo--xl-blueviolet">
</a> </a>
`Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`_, `Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`_, Zihang
Zihang Dai et al. Dai et al.
Same as a regular GPT model, but introduces a recurrence mechanism for two consecutive segments (similar to a regular Same as a regular GPT model, but introduces a recurrence mechanism for two consecutive segments (similar to a regular
RNNs with two consecutive inputs). In this context, a segment is a number of consecutive tokens (for instance 512) that RNNs with two consecutive inputs). In this context, a segment is a number of consecutive tokens (for instance 512) that
...@@ -153,8 +154,7 @@ Reformer ...@@ -153,8 +154,7 @@ Reformer
<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-reformer-blueviolet"> <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-reformer-blueviolet">
</a> </a>
`Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451>`_, `Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451>`_, Nikita Kitaev et al .
Nikita Kitaev et al .
An autoregressive transformer model with lots of tricks to reduce memory footprint and compute time. Those tricks An autoregressive transformer model with lots of tricks to reduce memory footprint and compute time. Those tricks
include: include:
...@@ -188,8 +188,8 @@ XLNet ...@@ -188,8 +188,8 @@ XLNet
<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlnet-blueviolet"> <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlnet-blueviolet">
</a> </a>
`XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_, `XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_, Zhilin
Zhilin Yang et al. Yang et al.
XLNet is not a traditional autoregressive model but uses a training strategy that builds on that. It permutes the XLNet is not a traditional autoregressive model but uses a training strategy that builds on that. It permutes the
tokens in the sentence, then allows the model to use the last n tokens to predict the token n+1. Since this is all done tokens in the sentence, then allows the model to use the last n tokens to predict the token n+1. Since this is all done
...@@ -207,7 +207,8 @@ Autoencoding models ...@@ -207,7 +207,8 @@ Autoencoding models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
As mentioned before, these models rely on the encoder part of the original transformer and use no mask so the model can As mentioned before, these models rely on the encoder part of the original transformer and use no mask so the model can
look at all the tokens in the attention heads. For pretraining, targets are the original sentences and inputs are their corrupted versions. look at all the tokens in the attention heads. For pretraining, targets are the original sentences and inputs are their
corrupted versions.
BERT BERT
----------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
...@@ -260,8 +261,8 @@ Same as BERT but with a few tweaks: ...@@ -260,8 +261,8 @@ Same as BERT but with a few tweaks:
sequence of tokens) so it's more logical to have H >> E. Also, the embedding matrix is large since it's V x E (V sequence of tokens) so it's more logical to have H >> E. Also, the embedding matrix is large since it's V x E (V
being the vocab size). If E < H, it has less parameters. being the vocab size). If E < H, it has less parameters.
* Layers are split in groups that share parameters (to save memory). * Layers are split in groups that share parameters (to save memory).
* Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and B * Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and
(that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have B (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have
been swapped or not. been swapped or not.
The library provides a version of the model for masked language modeling, token classification, sentence The library provides a version of the model for masked language modeling, token classification, sentence
...@@ -279,8 +280,7 @@ RoBERTa ...@@ -279,8 +280,7 @@ RoBERTa
<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-roberta-blueviolet"> <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-roberta-blueviolet">
</a> </a>
`RoBERTa: A Robustly Optimized BERT Pretraining Approach <https://arxiv.org/abs/1907.11692>`_, `RoBERTa: A Robustly Optimized BERT Pretraining Approach <https://arxiv.org/abs/1907.11692>`_, Yinhan Liu et al.
Yinhan Liu et al.
Same as BERT with better pretraining tricks: Same as BERT with better pretraining tricks:
...@@ -339,8 +339,8 @@ library provides checkpoints for all of them: ...@@ -339,8 +339,8 @@ library provides checkpoints for all of them:
previous section as well). One of the languages is selected for each training sample, and the model input is a previous section as well). One of the languages is selected for each training sample, and the model input is a
sentence of 256 tokens, that may span over several documents in one of those languages. sentence of 256 tokens, that may span over several documents in one of those languages.
* Masked language modeling (MLM) which is like RoBERTa. One of the languages is selected for each training sample, * Masked language modeling (MLM) which is like RoBERTa. One of the languages is selected for each training sample,
and the model input is a sentence of 256 tokens, that may span over several documents in one of those languages, with and the model input is a sentence of 256 tokens, that may span over several documents in one of those languages,
dynamic masking of the tokens. with dynamic masking of the tokens.
* A combination of MLM and translation language modeling (TLM). This consists of concatenating a sentence in two * A combination of MLM and translation language modeling (TLM). This consists of concatenating a sentence in two
different languages, with random masking. To predict one of the masked tokens, the model can use both, the different languages, with random masking. To predict one of the masked tokens, the model can use both, the
surrounding context in language 1 and the context given by language 2. surrounding context in language 1 and the context given by language 2.
...@@ -523,20 +523,21 @@ Pegasus ...@@ -523,20 +523,21 @@ Pegasus
<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-pegasus-blueviolet"> <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-pegasus-blueviolet">
</a> </a>
`PEGASUS: Pre-training with Extracted Gap-sentences forAbstractive Summarization `PEGASUS: Pre-training with Extracted Gap-sentences forAbstractive Summarization
<https://arxiv.org/pdf/1912.08777.pdf>`_, Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019. <https://arxiv.org/pdf/1912.08777.pdf>`_, Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.
Sequence-to-sequence model with the same encoder-decoder model architecture as BART. Pegasus is pre-trained jointly on Sequence-to-sequence model with the same encoder-decoder model architecture as BART. Pegasus is pre-trained jointly on
two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pre-training two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pre-training
objective, called Gap Sentence Generation (GSG). objective, called Gap Sentence Generation (GSG).
* MLM: encoder input tokens are randomely replaced by a mask tokens and have to be predicted by the encoder (like * MLM: encoder input tokens are randomely replaced by a mask tokens and have to be predicted by the encoder (like in
in BERT) BERT)
* GSG: whole encoder input sentences are replaced by a second mask token and fed to the decoder, but which has a * GSG: whole encoder input sentences are replaced by a second mask token and fed to the decoder, but which has a
causal mask to hide the future words like a regular auto-regressive transformer decoder. causal mask to hide the future words like a regular auto-regressive transformer decoder.
In contrast to BART, Pegasus' pretraining task is intentionally similar to summarization: important sentences are In contrast to BART, Pegasus' pretraining task is intentionally similar to summarization: important sentences are
masked and are generated together as one output sequence from the remaining sentences, similar to an extractive summary. masked and are generated together as one output sequence from the remaining sentences, similar to an extractive
summary.
The library provides a version of this model for conditional generation, which should be used for summarization. The library provides a version of this model for conditional generation, which should be used for summarization.
...@@ -571,20 +572,20 @@ T5 ...@@ -571,20 +572,20 @@ T5
<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-t5-blueviolet"> <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-t5-blueviolet">
</a> </a>
`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer <https://arxiv.org/abs/1910.10683>`_, `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel et al. <https://arxiv.org/abs/1910.10683>`_, Colin Raffel et al.
Uses the traditional transformer model (with a slight change in the positional embeddings, which are learned at Uses the traditional transformer model (with a slight change in the positional embeddings, which are learned at each
each layer). To be able to operate on all NLP tasks, it transforms them into text-to-text problems by using specific layer). To be able to operate on all NLP tasks, it transforms them into text-to-text problems by using specific
prefixes: summarize: , question: , translate English to German: and so forth. prefixes: summarize: , question: , translate English to German: and so forth.
The pretraining includes both supervised and self-supervised training. Supervised training is conducted on downstream The pretraining includes both supervised and self-supervised training. Supervised training is conducted on downstream
tasks provided by the GLUE and SuperGLUE benchmarks (converting them into text-to-text tasks as explained above). tasks provided by the GLUE and SuperGLUE benchmarks (converting them into text-to-text tasks as explained above).
Self-supervised training uses corrupted tokens, by randomly removing 15% of the tokens and Self-supervised training uses corrupted tokens, by randomly removing 15% of the tokens and replacing them with
replacing them with individual sentinel tokens (if several consecutive tokens are marked for removal, the whole group individual sentinel tokens (if several consecutive tokens are marked for removal, the whole group is replaced with a
is replaced with a single sentinel token). The input of the encoder is the corrupted sentence, the input of the decoder single sentinel token). The input of the encoder is the corrupted sentence, the input of the decoder is the original
is the original sentence and the target is then the dropped out tokens delimited by their sentinel tokens. sentence and the target is then the dropped out tokens delimited by their sentinel tokens.
For instance, if we have the sentence My dog is very cute ., and we decide to remove the tokens: "dog", "is" and For instance, if we have the sentence My dog is very cute ., and we decide to remove the tokens: "dog", "is" and
"cute", the encoder input becomes My <x> very <y> . and the target input becomes <x> dog is <y> cute .<z> "cute", the encoder input becomes My <x> very <y> . and the target input becomes <x> dog is <y> cute .<z>
...@@ -603,13 +604,12 @@ MBart ...@@ -603,13 +604,12 @@ MBart
<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-mbart-blueviolet"> <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-mbart-blueviolet">
</a> </a>
`Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan `Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu,
Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
The model architecture and pre-training objective is same as BART, but MBart is trained on 25 languages The model architecture and pre-training objective is same as BART, but MBart is trained on 25 languages and is intended
and is intended for supervised and unsupervised machine translation. MBart is one of the first methods for supervised and unsupervised machine translation. MBart is one of the first methods for pre-training a complete
for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, sequence-to-sequence model by denoising full texts in multiple languages,
The library provides a version of this model for conditional generation. The library provides a version of this model for conditional generation.
...@@ -636,11 +636,11 @@ ProphetNet ...@@ -636,11 +636,11 @@ ProphetNet
Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou. Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.
ProphetNet introduces a novel *sequence-to-sequence* pre-training objective, called *future n-gram prediction*. In ProphetNet introduces a novel *sequence-to-sequence* pre-training objective, called *future n-gram prediction*. In
future n-gram prediction, the model predicts the next n tokens simultaneously based on previous context tokens at future n-gram prediction, the model predicts the next n tokens simultaneously based on previous context tokens at each
each time step instead instead of just the single next token. The future n-gram prediction explicitly encourages time step instead instead of just the single next token. The future n-gram prediction explicitly encourages the model
the model to plan for the future tokens and prevent overfitting on strong local correlations. to plan for the future tokens and prevent overfitting on strong local correlations. The model architecture is based on
The model architecture is based on the original Transformer, but replaces the "standard" self-attention mechanism the original Transformer, but replaces the "standard" self-attention mechanism in the decoder by a a main
in the decoder by a a main self-attention mechanism and a self and n-stream (predict) self-attention mechanism. self-attention mechanism and a self and n-stream (predict) self-attention mechanism.
The library provides a pre-trained version of this model for conditional generation and a fine-tuned version for The library provides a pre-trained version of this model for conditional generation and a fine-tuned version for
summarization. summarization.
...@@ -682,8 +682,8 @@ et al. ...@@ -682,8 +682,8 @@ et al.
A transformers model used in multimodal settings, combining a text and an image to make predictions. The transformer A transformers model used in multimodal settings, combining a text and an image to make predictions. The transformer
model takes as inputs the embeddings of the tokenized text and the final activations of a pretrained on images resnet model takes as inputs the embeddings of the tokenized text and the final activations of a pretrained on images resnet
(after the pooling layer) that goes through a linear layer (to go from number of features at the end of the (after the pooling layer) that goes through a linear layer (to go from number of features at the end of the resnet to
resnet to the hidden state dimension of the transformer). the hidden state dimension of the transformer).
The different inputs are concatenated, and on top of the positional embeddings, a segment embedding is added to let the The different inputs are concatenated, and on top of the positional embeddings, a segment embedding is added to let the
model know which part of the input vector corresponds to the text and which to the image. model know which part of the input vector corresponds to the text and which to the image.
...@@ -691,8 +691,7 @@ model know which part of the input vector corresponds to the text and which to t ...@@ -691,8 +691,7 @@ model know which part of the input vector corresponds to the text and which to t
The pretrained model only works for classification. The pretrained model only works for classification.
.. ..
More information in this :doc:`model documentation </model_doc/mmbt.html>`. More information in this :doc:`model documentation </model_doc/mmbt.html>`. TODO: write this page
TODO: write this page
.. _retrieval-based-models: .. _retrieval-based-models:
...@@ -714,19 +713,22 @@ DPR ...@@ -714,19 +713,22 @@ DPR
<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-dpr-blueviolet"> <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-dpr-blueviolet">
</a> </a>
`Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`_, `Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`_, Vladimir Karpukhin et
Vladimir Karpukhin et al. al.
Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain question-answering research. Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain question-answering
research.
DPR consists in three models: DPR consists in three models:
* Question encoder: encode questions as vectors * Question encoder: encode questions as vectors
* Context encoder: encode contexts as vectors * Context encoder: encode contexts as vectors
* Reader: extract the answer of the questions inside retrieved contexts, along with a relevance score (high if the inferred span actually answers the question). * Reader: extract the answer of the questions inside retrieved contexts, along with a relevance score (high if the
inferred span actually answers the question).
DPR's pipeline (not implemented yet) uses a retrieval step to find the top k contexts given a certain question, and then it calls the reader with the question and the retrieved documents to get the answer. DPR's pipeline (not implemented yet) uses a retrieval step to find the top k contexts given a certain question, and
then it calls the reader with the question and the retrieved documents to get the answer.
RAG RAG
----------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
...@@ -740,12 +742,14 @@ RAG ...@@ -740,12 +742,14 @@ RAG
<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-rag-blueviolet"> <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-rag-blueviolet">
</a> </a>
`Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks <https://arxiv.org/abs/2005.11401>`_, `Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks <https://arxiv.org/abs/2005.11401>`_, Patrick Lewis,
Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau
Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela
Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and Seq2Seq models. Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and Seq2Seq
RAG models retrieve docs, pass them to a seq2seq model, then marginalize to generate outputs. models. RAG models retrieve docs, pass them to a seq2seq model, then marginalize to generate outputs. The retriever and
The retriever and seq2seq modules are initialized from pretrained models, and fine-tuned jointly, allowing both retrieval and generation to adapt to downstream tasks. seq2seq modules are initialized from pretrained models, and fine-tuned jointly, allowing both retrieval and generation
to adapt to downstream tasks.
The two models RAG-Token and RAG-Sequence are available for generation. The two models RAG-Token and RAG-Sequence are available for generation.
...@@ -764,19 +768,19 @@ use a sparse version of the attention matrix to speed up training. ...@@ -764,19 +768,19 @@ use a sparse version of the attention matrix to speed up training.
**LSH attention** **LSH attention**
:ref:`Reformer <reformer>` uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax :ref:`Reformer <reformer>` uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax
dimension) of the matrix QK^t are going to give useful contributions. So for each query q in Q, we can consider only dimension) of the matrix QK^t are going to give useful contributions. So for each query q in Q, we can consider only
the keys k in K that are close to q. A hash function is used to determine if q and k are close. The attention mask is the keys k in K that are close to q. A hash function is used to determine if q and k are close. The attention mask is
modified to mask the current token (except at the first position), because it will give a query and a key equal (so very modified to mask the current token (except at the first position), because it will give a query and a key equal (so
similar to each other). Since the hash can be a bit random, several hash functions are used in practice (determined by very similar to each other). Since the hash can be a bit random, several hash functions are used in practice
a n_rounds parameter) and then are averaged together. (determined by a n_rounds parameter) and then are averaged together.
.. _local-attention: .. _local-attention:
**Local attention** **Local attention**
:ref:`Longformer <longformer>` uses local attention: often, the local context (e.g., what are the two tokens to the left and :ref:`Longformer <longformer>` uses local attention: often, the local context (e.g., what are the two tokens to the
right?) is enough to take action for a given token. Also, by stacking attention layers that have a small window, the left and right?) is enough to take action for a given token. Also, by stacking attention layers that have a small
last layer will have a receptive field of more than just the tokens in the window, allowing them to build a window, the last layer will have a receptive field of more than just the tokens in the window, allowing them to build a
representation of the whole sentence. representation of the whole sentence.
Some preselected input tokens are also given global attention: for those few tokens, the attention matrix can access Some preselected input tokens are also given global attention: for those few tokens, the attention matrix can access
...@@ -799,8 +803,9 @@ Other tricks ...@@ -799,8 +803,9 @@ Other tricks
:ref:`Reformer <reformer>` uses axial positional encodings: in traditional transformer models, the positional encoding :ref:`Reformer <reformer>` uses axial positional encodings: in traditional transformer models, the positional encoding
E is a matrix of size :math:`l` by :math:`d`, :math:`l` being the sequence length and :math:`d` the dimension of the E is a matrix of size :math:`l` by :math:`d`, :math:`l` being the sequence length and :math:`d` the dimension of the
hidden state. If you have very long texts, this matrix can be huge and take way too much space on the GPU. To alleviate that, axial positional encodings consist of factorizing that big matrix E in two smaller matrices E1 and hidden state. If you have very long texts, this matrix can be huge and take way too much space on the GPU. To alleviate
E2, with dimensions :math:`l_{1} \times d_{1}` and :math:`l_{2} \times d_{2}`, such that :math:`l_{1} \times l_{2} = l` that, axial positional encodings consist of factorizing that big matrix E in two smaller matrices E1 and E2, with
and :math:`d_{1} + d_{2} = d` (with the product for the lengths, this ends up being way smaller). The embedding for dimensions :math:`l_{1} \times d_{1}` and :math:`l_{2} \times d_{2}`, such that :math:`l_{1} \times l_{2} = l` and
time step :math:`j` in E is obtained by concatenating the embeddings for timestep :math:`j \% l1` in E1 and :math:`d_{1} + d_{2} = d` (with the product for the lengths, this ends up being way smaller). The embedding for time
:math:`j // l1` in E2. step :math:`j` in E is obtained by concatenating the embeddings for timestep :math:`j \% l1` in E1 and :math:`j // l1`
in E2.
Multi-lingual models Multi-lingual models
======================================================================================================================= =======================================================================================================================
Most of the models available in this library are mono-lingual models (English, Chinese and German). A few Most of the models available in this library are mono-lingual models (English, Chinese and German). A few multi-lingual
multi-lingual models are available and have a different mechanisms than mono-lingual models. models are available and have a different mechanisms than mono-lingual models. This page details the usage of these
This page details the usage of these models. models.
The two models that currently support multiple languages are BERT and XLM. The two models that currently support multiple languages are BERT and XLM.
...@@ -28,8 +28,8 @@ This section concerns the following checkpoints: ...@@ -28,8 +28,8 @@ This section concerns the following checkpoints:
These checkpoints require language embeddings that will specify the language used at inference time. These language These checkpoints require language embeddings that will specify the language used at inference time. These language
embeddings are represented as a tensor that is of the same shape as the input ids passed to the model. The values in embeddings are represented as a tensor that is of the same shape as the input ids passed to the model. The values in
these tensors depend on the language used and are identifiable using the ``lang2id`` and ``id2lang`` attributes these tensors depend on the language used and are identifiable using the ``lang2id`` and ``id2lang`` attributes from
from the tokenizer. the tokenizer.
Here is an example using the ``xlm-clm-enfr-1024`` checkpoint (Causal language modeling, English-French): Here is an example using the ``xlm-clm-enfr-1024`` checkpoint (Causal language modeling, English-French):
...@@ -78,8 +78,9 @@ You can then feed it all as input to your model: ...@@ -78,8 +78,9 @@ You can then feed it all as input to your model:
>>> outputs = model(input_ids, langs=langs) >>> outputs = model(input_ids, langs=langs)
The example `run_generation.py <https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py>`__ The example `run_generation.py
can generate text using the CLM checkpoints from XLM, using the language embeddings. <https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py>`__ can generate
text using the CLM checkpoints from XLM, using the language embeddings.
XLM without Language Embeddings XLM without Language Embeddings
----------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
...@@ -89,8 +90,8 @@ This section concerns the following checkpoints: ...@@ -89,8 +90,8 @@ This section concerns the following checkpoints:
- ``xlm-mlm-17-1280`` (Masked language modeling, 17 languages) - ``xlm-mlm-17-1280`` (Masked language modeling, 17 languages)
- ``xlm-mlm-100-1280`` (Masked language modeling, 100 languages) - ``xlm-mlm-100-1280`` (Masked language modeling, 100 languages)
These checkpoints do not require language embeddings at inference time. These models are used to have generic These checkpoints do not require language embeddings at inference time. These models are used to have generic sentence
sentence representations, differently from previously-mentioned XLM checkpoints. representations, differently from previously-mentioned XLM checkpoints.
BERT BERT
...@@ -101,15 +102,15 @@ BERT has two checkpoints that can be used for multi-lingual tasks: ...@@ -101,15 +102,15 @@ BERT has two checkpoints that can be used for multi-lingual tasks:
- ``bert-base-multilingual-uncased`` (Masked language modeling + Next sentence prediction, 102 languages) - ``bert-base-multilingual-uncased`` (Masked language modeling + Next sentence prediction, 102 languages)
- ``bert-base-multilingual-cased`` (Masked language modeling + Next sentence prediction, 104 languages) - ``bert-base-multilingual-cased`` (Masked language modeling + Next sentence prediction, 104 languages)
These checkpoints do not require language embeddings at inference time. They should identify the language These checkpoints do not require language embeddings at inference time. They should identify the language used in the
used in the context and infer accordingly. context and infer accordingly.
XLM-RoBERTa XLM-RoBERTa
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
XLM-RoBERTa was trained on 2.5TB of newly created clean CommonCrawl data in 100 languages. It provides strong XLM-RoBERTa was trained on 2.5TB of newly created clean CommonCrawl data in 100 languages. It provides strong gains
gains over previously released multi-lingual models like mBERT or XLM on downstream taks like classification, over previously released multi-lingual models like mBERT or XLM on downstream taks like classification, sequence
sequence labeling and question answering. labeling and question answering.
Two XLM-RoBERTa checkpoints can be used for multi-lingual tasks: Two XLM-RoBERTa checkpoints can be used for multi-lingual tasks:
......
Perplexity of fixed-length models Perplexity of fixed-length models
======================================================================================================================= =======================================================================================================================
Perplexity (PPL) is one of the most common metrics for evaluating language Perplexity (PPL) is one of the most common metrics for evaluating language models. Before diving in, we should note
models. Before diving in, we should note that the metric applies specifically that the metric applies specifically to classical language models (sometimes called autoregressive or causal language
to classical language models (sometimes called autoregressive or causal models) and is not well defined for masked language models like BERT (see :doc:`summary of the models
language models) and is not well defined for masked language models like BERT <model_summary>`).
(see :doc:`summary of the models <model_summary>`).
Perplexity is defined as the exponentiated average log-likelihood of a Perplexity is defined as the exponentiated average log-likelihood of a sequence. If we have a tokenized sequence
sequence. If we have a tokenized sequence :math:`X = (x_0, x_1, \dots, x_t)`, :math:`X = (x_0, x_1, \dots, x_t)`, then the perplexity of :math:`X` is,
then the perplexity of :math:`X` is,
.. math:: .. math::
\text{PPL}(X) \text{PPL}(X)
= \exp \left\{ {-\frac{1}{t}\sum_i^t \log p_\theta (x_i|x_{<i}) } \right\} = \exp \left\{ {-\frac{1}{t}\sum_i^t \log p_\theta (x_i|x_{<i}) } \right\}
where :math:`\log p_\theta (x_i|x_{<i})` is the log-likelihood of the ith where :math:`\log p_\theta (x_i|x_{<i})` is the log-likelihood of the ith token conditioned on the preceding tokens
token conditioned on the preceding tokens :math:`x_{<i}` according to our :math:`x_{<i}` according to our model. Intuitively, it can be thought of as an evaluation of the model's ability to
model. Intuitively, it can be thought of as an evaluation of the model's predict uniformly among the set of specified tokens in a corpus. Importantly, this means that the tokenization
ability to predict uniformly among the set of specified tokens in a corpus. procedure has a direct impact on a model's perplexity which should always be taken into consideration when comparing
Importantly, this means that the tokenization procedure has a direct impact different models.
on a model's perplexity which should always be taken into consideration when
comparing different models.
This is also equivalent to the exponentiation of the cross-entropy between This is also equivalent to the exponentiation of the cross-entropy between the data and model predictions. For more
the data and model predictions. For more intuition about perplexity and its intuition about perplexity and its relationship to Bits Per Character (BPC) and data compression, check out this
relationship to Bits Per Character (BPC) and data compression, check out this `fantastic blog post on The Gradient <https://thegradient.pub/understanding-evaluation-metrics-for-language-models/>`_.
`fantastic blog post on The Gradient
<https://thegradient.pub/understanding-evaluation-metrics-for-language-models/>`_.
Calculating PPL with fixed-length models Calculating PPL with fixed-length models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If we weren't limited by a model's context size, we would evaluate the If we weren't limited by a model's context size, we would evaluate the model's perplexity by autoregressively
model's perplexity by autoregressively factorizing a sequence and factorizing a sequence and conditioning on the entire preceding subsequence at each step, as shown below.
conditioning on the entire preceding subsequence at each step, as shown
below.
.. image:: imgs/ppl_full.gif .. image:: imgs/ppl_full.gif
:width: 600 :width: 600
:alt: Full decomposition of a sequence with unlimited context length :alt: Full decomposition of a sequence with unlimited context length
When working with approximate models, however, we typically have a constraint When working with approximate models, however, we typically have a constraint on the number of tokens the model can
on the number of tokens the model can process. The largest version process. The largest version of :doc:`GPT-2 <model_doc/gpt2>`, for example, has a fixed length of 1024 tokens, so we
of :doc:`GPT-2 <model_doc/gpt2>`, for example, has a fixed length of 1024 cannot calculate :math:`p_\theta(x_t|x_{<t})` directly when :math:`t` is greater than 1024.
tokens, so we cannot calculate :math:`p_\theta(x_t|x_{<t})` directly when
:math:`t` is greater than 1024.
Instead, the sequence is typically broken into subsequences equal to the Instead, the sequence is typically broken into subsequences equal to the model's maximum input size. If a model's max
model's maximum input size. If a model's max input size is :math:`k`, we input size is :math:`k`, we then approximate the likelihood of a token :math:`x_t` by conditioning only on the
then approximate the likelihood of a token :math:`x_t` by conditioning only :math:`k-1` tokens that precede it rather than the entire context. When evaluating the model's perplexity of a
on the :math:`k-1` tokens that precede it rather than the entire context. sequence, a tempting but suboptimal approach is to break the sequence into disjoint chunks and add up the decomposed
When evaluating the model's perplexity of a sequence, a tempting but log-likelihoods of each segment independently.
suboptimal approach is to break the sequence into disjoint chunks and
add up the decomposed log-likelihoods of each segment independently.
.. image:: imgs/ppl_chunked.gif .. image:: imgs/ppl_chunked.gif
:width: 600 :width: 600
:alt: Suboptimal PPL not taking advantage of full available context :alt: Suboptimal PPL not taking advantage of full available context
This is quick to compute since the perplexity of each segment can be computed This is quick to compute since the perplexity of each segment can be computed in one forward pass, but serves as a poor
in one forward pass, but serves as a poor approximation of the approximation of the fully-factorized perplexity and will typically yield a higher (worse) PPL because the model will
fully-factorized perplexity and will typically yield a higher (worse) PPL have less context at most of the prediction steps.
because the model will have less context at most of the prediction steps.
Instead, the PPL of fixed-length models should be evaluated with a Instead, the PPL of fixed-length models should be evaluated with a sliding-window strategy. This involves repeatedly
sliding-window strategy. This involves repeatedly sliding the sliding the context window so that the model has more context when making each prediction.
context window so that the model has more context when making each
prediction.
.. image:: imgs/ppl_sliding.gif .. image:: imgs/ppl_sliding.gif
:width: 600 :width: 600
:alt: Sliding window PPL taking advantage of all available context :alt: Sliding window PPL taking advantage of all available context
This is a closer approximation to the true decomposition of the This is a closer approximation to the true decomposition of the sequence probability and will typically yield a more
sequence probability and will typically yield a more favorable score. favorable score. The downside is that it requires a separate forward pass for each token in the corpus. A good
The downside is that it requires a separate forward pass for each token in practical compromise is to employ a strided sliding window, moving the context by larger strides rather than sliding by
the corpus. A good practical compromise is to employ a strided sliding 1 token a time. This allows computation to procede much faster while still giving the model a large context to make
window, moving the context by larger strides rather than sliding by 1 token a predictions at each step.
time. This allows computation to procede much faster while still giving the
model a large context to make predictions at each step.
Example: Calculating perplexity with GPT-2 in 🤗 Transformers Example: Calculating perplexity with GPT-2 in 🤗 Transformers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...@@ -95,10 +78,9 @@ Let's demonstrate this process with GPT-2. ...@@ -95,10 +78,9 @@ Let's demonstrate this process with GPT-2.
model = GPT2LMHeadModel.from_pretrained(model_id).to(device) model = GPT2LMHeadModel.from_pretrained(model_id).to(device)
tokenizer = GPT2TokenizerFast.from_pretrained(model_id) tokenizer = GPT2TokenizerFast.from_pretrained(model_id)
We'll load in the WikiText-2 dataset and evaluate the perplexity using a few We'll load in the WikiText-2 dataset and evaluate the perplexity using a few different sliding-window strategies. Since
different sliding-window strategies. Since this dataset is small and we're this dataset is small and we're just doing one forward pass over the set, we can just load and encode the entire
just doing one forward pass over the set, we can just load and encode the dataset in memory.
entire dataset in memory.
.. code-block:: python .. code-block:: python
...@@ -106,16 +88,13 @@ entire dataset in memory. ...@@ -106,16 +88,13 @@ entire dataset in memory.
test = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test') test = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
encodings = tokenizer('\n\n'.join(test['text']), return_tensors='pt') encodings = tokenizer('\n\n'.join(test['text']), return_tensors='pt')
With 🤗 Transformers, we can simply pass the ``input_ids`` as the ``labels`` With 🤗 Transformers, we can simply pass the ``input_ids`` as the ``labels`` to our model, and the average
to our model, and the average log-likelihood for each token is returned as log-likelihood for each token is returned as the loss. With our sliding window approach, however, there is overlap in
the loss. With our sliding window approach, however, there is overlap in the the tokens we pass to the model at each iteration. We don't want the log-likelihood for the tokens we're just treating
tokens we pass to the model at each iteration. We don't want the as context to be included in our loss, so we can set these targets to ``-100`` so that they are ignored. The following
log-likelihood for the tokens we're just treating as context to be included is an example of how we could do this with a stride of ``512``. This means that the model will have at least 512 tokens
in our loss, so we can set these targets to ``-100`` so that they are for context when calculating the conditional likelihood of any one token (provided there are 512 preceding tokens
ignored. The following is an example of how we could do this with a stride of available to condition on).
``512``. This means that the model will have at least 512 tokens for context
when calculating the conditional likelihood of any one token (provided there
are 512 preceding tokens available to condition on).
.. code-block:: python .. code-block:: python
...@@ -139,14 +118,11 @@ are 512 preceding tokens available to condition on). ...@@ -139,14 +118,11 @@ are 512 preceding tokens available to condition on).
ppl = torch.exp(torch.stack(lls).sum() / end_loc) ppl = torch.exp(torch.stack(lls).sum() / end_loc)
Running this with the stride length equal to the max input length is Running this with the stride length equal to the max input length is equivalent to the suboptimal, non-sliding-window
equivalent to the suboptimal, non-sliding-window strategy we discussed above. strategy we discussed above. The smaller the stride, the more context the model will have in making each prediction,
The smaller the stride, the more context the model will have in making each and the better the reported perplexity will typically be.
prediction, and the better the reported perplexity will typically be.
When we run the above with ``stride = 1024``, i.e. no overlap, the resulting PPL is ``19.64``, which is about the same
When we run the above with ``stride = 1024``, i.e. no overlap, the resulting as the ``19.93`` reported in the GPT-2 paper. By using ``stride = 512`` and thereby employing our striding window
PPL is ``19.64``, which is about the same as the ``19.93`` reported in the strategy, this jumps down to ``16.53``. This is not only a more favorable score, but is calculated in a way that is
GPT-2 paper. By using ``stride = 512`` and thereby employing our striding closer to the true autoregressive decomposition of a sequence likelihood.
window strategy, this jumps down to ``16.53``. This is not only a more
favorable score, but is calculated in a way that is closer to the true
autoregressive decomposition of a sequence likelihood.
...@@ -12,15 +12,15 @@ The library was designed with two strong goals in mind: ...@@ -12,15 +12,15 @@ The library was designed with two strong goals in mind:
- Be as easy and fast to use as possible: - Be as easy and fast to use as possible:
- We strongly limited the number of user-facing abstractions to learn, in fact, there are almost no abstractions, - We strongly limited the number of user-facing abstractions to learn, in fact, there are almost no abstractions,
just three standard classes required to use each model: :doc:`configuration <main_classes/configuration>`, just three standard classes required to use each model: :doc:`configuration <main_classes/configuration>`,
:doc:`models <main_classes/model>` and :doc:`tokenizer <main_classes/tokenizer>`. :doc:`models <main_classes/model>` and :doc:`tokenizer <main_classes/tokenizer>`.
- All of these classes can be initialized in a simple and unified way from pretrained instances by using a common - All of these classes can be initialized in a simple and unified way from pretrained instances by using a common
:obj:`from_pretrained()` instantiation method which will take care of downloading (if needed), caching and :obj:`from_pretrained()` instantiation method which will take care of downloading (if needed), caching and
loading the related class instance and associated data (configurations' hyper-parameters, tokenizers' vocabulary, loading the related class instance and associated data (configurations' hyper-parameters, tokenizers' vocabulary,
and models' weights) from a pretrained checkpoint provided on and models' weights) from a pretrained checkpoint provided on `Hugging Face Hub
`Hugging Face Hub <https://huggingface.co/models>`__ or your own saved checkpoint. <https://huggingface.co/models>`__ or your own saved checkpoint.
- On top of those three base classes, the library provides two APIs: :func:`~transformers.pipeline` for quickly - On top of those three base classes, the library provides two APIs: :func:`~transformers.pipeline` for quickly
using a model (plus its associated tokenizer and configuration) on a given task and using a model (plus its associated tokenizer and configuration) on a given task and
:func:`~transformers.Trainer`/:func:`~transformers.TFTrainer` to quickly train or fine-tune a given model. :func:`~transformers.Trainer`/:func:`~transformers.TFTrainer` to quickly train or fine-tune a given model.
- As a consequence, this library is NOT a modular toolbox of building blocks for neural nets. If you want to - As a consequence, this library is NOT a modular toolbox of building blocks for neural nets. If you want to
extend/build-upon the library, just use regular Python/PyTorch/TensorFlow/Keras modules and inherit from the base extend/build-upon the library, just use regular Python/PyTorch/TensorFlow/Keras modules and inherit from the base
...@@ -52,10 +52,10 @@ Main concepts ...@@ -52,10 +52,10 @@ Main concepts
The library is built around three types of classes for each model: The library is built around three types of classes for each model:
- **Model classes** such as :class:`~transformers.BertModel`, which are 30+ PyTorch models - **Model classes** such as :class:`~transformers.BertModel`, which are 30+ PyTorch models (`torch.nn.Module
(`torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__) or Keras models <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__) or Keras models (`tf.keras.Model
(`tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__) that work with the pretrained <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__) that work with the pretrained weights provided in the
weights provided in the library. library.
- **Configuration classes** such as :class:`~transformers.BertConfig`, which store all the parameters required to build - **Configuration classes** such as :class:`~transformers.BertConfig`, which store all the parameters required to build
a model. You don't always need to instantiate these yourself. In particular, if you are using a pretrained model a model. You don't always need to instantiate these yourself. In particular, if you are using a pretrained model
without any modification, creating the model will automatically take care of instantiating the configuration (which without any modification, creating the model will automatically take care of instantiating the configuration (which
...@@ -66,8 +66,8 @@ The library is built around three types of classes for each model: ...@@ -66,8 +66,8 @@ The library is built around three types of classes for each model:
All these classes can be instantiated from pretrained instances and saved locally using two methods: All these classes can be instantiated from pretrained instances and saved locally using two methods:
- :obj:`from_pretrained()` lets you instantiate a model/configuration/tokenizer from a pretrained version either - :obj:`from_pretrained()` lets you instantiate a model/configuration/tokenizer from a pretrained version either
provided by the library itself (the supported models are provided in the list :doc:`here <pretrained_models>` provided by the library itself (the supported models are provided in the list :doc:`here <pretrained_models>` or
or stored locally (or on a server) by the user, stored locally (or on a server) by the user,
- :obj:`save_pretrained()` lets you save a model/configuration/tokenizer locally so that it can be reloaded using - :obj:`save_pretrained()` lets you save a model/configuration/tokenizer locally so that it can be reloaded using
:obj:`from_pretrained()`. :obj:`from_pretrained()`.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment