"...git@developer.sourcefind.cn:chenpangpang/open-webui.git" did not exist on "422159477809730c85fadd06ef9dd3cefb3deb32"
Unverified Commit 08f534d2 authored by Sylvain Gugger's avatar Sylvain Gugger Committed by GitHub
Browse files

Doc styling (#8067)

* Important files

* Styling them all

* Revert "Styling them all"

This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e.

* Syling them for realsies

* Fix syntax error

* Fix benchmark_utils

* More fixes

* Fix modeling auto and script

* Remove new line

* Fixes

* More fixes

* Fix more files

* Style

* Add FSMT

* More fixes

* More fixes

* More fixes

* More fixes

* Fixes

* More fixes

* More fixes

* Last fixes

* Make sphinx happy
parent 04a17f85
...@@ -29,10 +29,10 @@ The Authors' code can be found `here <https://github.com/pytorch/fairseq/tree/ma ...@@ -29,10 +29,10 @@ The Authors' code can be found `here <https://github.com/pytorch/fairseq/tree/ma
Implementation Notes Implementation Notes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Bart doesn't use :obj:`token_type_ids` for sequence classification. Use :class:`~transformers.BartTokenizer` - Bart doesn't use :obj:`token_type_ids` for sequence classification. Use :class:`~transformers.BartTokenizer` or
or :meth:`~transformers.BartTokenizer.encode` to get the proper splitting. :meth:`~transformers.BartTokenizer.encode` to get the proper splitting.
- The forward pass of :class:`~transformers.BartModel` will create decoder inputs (using the helper function - The forward pass of :class:`~transformers.BartModel` will create decoder inputs (using the helper function
:func:`transformers.modeling_bart._prepare_bart_decoder_inputs`) if they are not passed. This is different than some :func:`transformers.modeling_bart._prepare_bart_decoder_inputs`) if they are not passed. This is different than some
other modeling APIs. other modeling APIs.
- Model predictions are intended to be identical to the original implementation. This only works, however, if the - Model predictions are intended to be identical to the original implementation. This only works, however, if the
string you pass to :func:`fairseq.encode` starts with a space. string you pass to :func:`fairseq.encode` starts with a space.
......
...@@ -25,8 +25,8 @@ improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).* ...@@ -25,8 +25,8 @@ improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).*
Tips: Tips:
- BERT is a model with absolute position embeddings so it's usually advised to pad the inputs on - BERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
the right rather than the left. the left.
- BERT was trained with the masked language modeling (MLM) and next sentence prediction (NSP) objectives. It is - BERT was trained with the masked language modeling (MLM) and next sentence prediction (NSP) objectives. It is
efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation.
......
...@@ -24,15 +24,15 @@ Usage: ...@@ -24,15 +24,15 @@ Usage:
- The model can be used in combination with the :class:`~transformers.EncoderDecoderModel` to leverage two pretrained - The model can be used in combination with the :class:`~transformers.EncoderDecoderModel` to leverage two pretrained
BERT checkpoints for subsequent fine-tuning. BERT checkpoints for subsequent fine-tuning.
:: code-block .. code-block::
# leverage checkpoints for Bert2Bert model... # leverage checkpoints for Bert2Bert model...
# use BERT's cls token as BOS token and sep token as EOS token # use BERT's cls token as BOS token and sep token as EOS token
encoder = BertGenerationEncoder.from_pretrained("bert-large-uncased", bos_token_id=101, eos_token_id=102) encoder = BertGenerationEncoder.from_pretrained("bert-large-uncased", bos_token_id=101, eos_token_id=102)
# add cross attention layers and use BERT's cls token as BOS token and sep token as EOS token # add cross attention layers and use BERT's cls token as BOS token and sep token as EOS token
decoder = BertGenerationDecoder.from_pretrained("bert-large-uncased", add_cross_attention=True, is_decoder=True, bos_token_id=101, eos_token_id=102) decoder = BertGenerationDecoder.from_pretrained("bert-large-uncased", add_cross_attention=True, is_decoder=True, bos_token_id=101, eos_token_id=102)
bert2bert = EncoderDecoderModel(encoder=encoder, decoder=decoder) bert2bert = EncoderDecoderModel(encoder=encoder, decoder=decoder)
# create tokenizer... # create tokenizer...
tokenizer = BertTokenizer.from_pretrained("bert-large-uncased") tokenizer = BertTokenizer.from_pretrained("bert-large-uncased")
......
Blenderbot Blenderbot
----------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
**DISCLAIMER:** If you see something strange,
file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ . **DISCLAIMER:** If you see something strange, file a `Github Issue
<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ .
Overview Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The Blender chatbot model was proposed in `Recipes for building an open-domain chatbot <https://arxiv.org/pdf/2004.13637.pdf>`__ Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020. The Blender chatbot model was proposed in `Recipes for building an open-domain chatbot
<https://arxiv.org/pdf/2004.13637.pdf>`__ Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu,
Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020.
The abstract of the paper is the following: The abstract of the paper is the following:
*Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that scaling neural models in the number of parameters and the size of the data they are trained on gives improved results, we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent persona. We show that large scale models can learn these skills when given appropriate training data and choice of generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing failure cases of our models.* *Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that
scaling neural models in the number of parameters and the size of the data they are trained on gives improved results,
we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of
skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to
their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent
persona. We show that large scale models can learn these skills when given appropriate training data and choice of
generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models
and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn
dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing
failure cases of our models.*
The authors' code can be found `here <https://github.com/facebookresearch/ParlAI>`__ . The authors' code can be found `here <https://github.com/facebookresearch/ParlAI>`__ .
...@@ -20,8 +32,11 @@ Implementation Notes ...@@ -20,8 +32,11 @@ Implementation Notes
- Blenderbot uses a standard `seq2seq model transformer <https://arxiv.org/pdf/1706.03762.pdf>`__ based architecture. - Blenderbot uses a standard `seq2seq model transformer <https://arxiv.org/pdf/1706.03762.pdf>`__ based architecture.
- It inherits completely from :class:`~transformers.BartForConditionalGeneration` - It inherits completely from :class:`~transformers.BartForConditionalGeneration`
- Even though blenderbot is one model, it uses two tokenizers :class:`~transformers.BlenderbotSmallTokenizer` for 90M checkpoint and :class:`~transformers.BlenderbotTokenizer` for all other checkpoints. - Even though blenderbot is one model, it uses two tokenizers :class:`~transformers.BlenderbotSmallTokenizer` for 90M
- :class:`~transformers.BlenderbotSmallTokenizer` will always return :class:`~transformers.BlenderbotSmallTokenizer`, regardless of checkpoint. To use the 3B parameter checkpoint, you must call :class:`~transformers.BlenderbotTokenizer` directly. checkpoint and :class:`~transformers.BlenderbotTokenizer` for all other checkpoints.
- :class:`~transformers.BlenderbotSmallTokenizer` will always return :class:`~transformers.BlenderbotSmallTokenizer`,
regardless of checkpoint. To use the 3B parameter checkpoint, you must call
:class:`~transformers.BlenderbotTokenizer` directly.
- Available checkpoints can be found in the `model hub <https://huggingface.co/models?search=blenderbot>`__. - Available checkpoints can be found in the `model hub <https://huggingface.co/models?search=blenderbot>`__.
...@@ -56,6 +71,7 @@ Here is how you can check out config values: ...@@ -56,6 +71,7 @@ Here is how you can check out config values:
BlenderbotConfig BlenderbotConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BlenderbotConfig .. autoclass:: transformers.BlenderbotConfig
:members: :members:
...@@ -74,6 +90,7 @@ BlenderbotSmallTokenizer ...@@ -74,6 +90,7 @@ BlenderbotSmallTokenizer
BlenderbotForConditionalGeneration BlenderbotForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
See :obj:`transformers.BartForConditionalGeneration` for arguments to `forward` and `generate` See :obj:`transformers.BartForConditionalGeneration` for arguments to `forward` and `generate`
.. autoclass:: transformers.BlenderbotForConditionalGeneration .. autoclass:: transformers.BlenderbotForConditionalGeneration
......
...@@ -4,26 +4,26 @@ CamemBERT ...@@ -4,26 +4,26 @@ CamemBERT
Overview Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The CamemBERT model was proposed in `CamemBERT: a Tasty French Language Model <https://arxiv.org/abs/1911.03894>`__ The CamemBERT model was proposed in `CamemBERT: a Tasty French Language Model <https://arxiv.org/abs/1911.03894>`__ by
by Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la
Clergerie, Djamé Seddah, and Benoît Sagot. It is based on Facebook's RoBERTa model released in 2019. It is a model Clergerie, Djamé Seddah, and Benoît Sagot. It is based on Facebook's RoBERTa model released in 2019. It is a model
trained on 138GB of French text. trained on 138GB of French text.
The abstract from the paper is the following: The abstract from the paper is the following:
*Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, *Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available
most available models have either been trained on English data or on the concatenation of data in multiple models have either been trained on English data or on the concatenation of data in multiple languages. This makes
languages. This makes practical use of such models --in all languages except English-- very limited. Aiming practical use of such models --in all languages except English-- very limited. Aiming to address this issue for French,
to address this issue for French, we release CamemBERT, a French version of the Bi-directional Encoders for we release CamemBERT, a French version of the Bi-directional Encoders for Transformers (BERT). We measure the
Transformers (BERT). We measure the performance of CamemBERT compared to multilingual models in multiple performance of CamemBERT compared to multilingual models in multiple downstream tasks, namely part-of-speech tagging,
downstream tasks, namely part-of-speech tagging, dependency parsing, named-entity recognition, and natural dependency parsing, named-entity recognition, and natural language inference. CamemBERT improves the state of the art
language inference. CamemBERT improves the state of the art for most of the tasks considered. We release the for most of the tasks considered. We release the pretrained model for CamemBERT hoping to foster research and
pretrained model for CamemBERT hoping to foster research and downstream applications for French NLP.* downstream applications for French NLP.*
Tips: Tips:
- This implementation is the same as RoBERTa. Refer to the :doc:`documentation of RoBERTa <roberta>` for usage - This implementation is the same as RoBERTa. Refer to the :doc:`documentation of RoBERTa <roberta>` for usage examples
examples as well as the information relative to the inputs and outputs. as well as the information relative to the inputs and outputs.
The original code can be found `here <https://camembert-model.fr/>`__. The original code can be found `here <https://camembert-model.fr/>`__.
...@@ -130,4 +130,4 @@ TFCamembertForQuestionAnswering ...@@ -130,4 +130,4 @@ TFCamembertForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFCamembertForQuestionAnswering .. autoclass:: transformers.TFCamembertForQuestionAnswering
:members: :members:
\ No newline at end of file
...@@ -6,33 +6,33 @@ Overview ...@@ -6,33 +6,33 @@ Overview
CTRL model was proposed in `CTRL: A Conditional Transformer Language Model for Controllable Generation CTRL model was proposed in `CTRL: A Conditional Transformer Language Model for Controllable Generation
<https://arxiv.org/abs/1909.05858>`_ by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and <https://arxiv.org/abs/1909.05858>`_ by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and
Richard Socher. It's a causal (unidirectional) transformer pre-trained using language modeling on a very large Richard Socher. It's a causal (unidirectional) transformer pre-trained using language modeling on a very large corpus
corpus of ~140 GB of text data with the first token reserved as a control code (such as Links, Books, Wikipedia etc.). of ~140 GB of text data with the first token reserved as a control code (such as Links, Books, Wikipedia etc.).
The abstract from the paper is the following: The abstract from the paper is the following:
*Large-scale language models show promising text generation capabilities, but users cannot easily control particular *Large-scale language models show promising text generation capabilities, but users cannot easily control particular
aspects of the generated text. We release CTRL, a 1.63 billion-parameter conditional transformer language model, aspects of the generated text. We release CTRL, a 1.63 billion-parameter conditional transformer language model,
trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were
derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while
while providing more explicit control over text generation. These codes also allow CTRL to predict which parts of providing more explicit control over text generation. These codes also allow CTRL to predict which parts of the
the training data are most likely given a sequence. This provides a potential method for analyzing large amounts training data are most likely given a sequence. This provides a potential method for analyzing large amounts of data
of data via model-based source attribution.* via model-based source attribution.*
Tips: Tips:
- CTRL makes use of control codes to generate text: it requires generations to be started by certain words, sentences - CTRL makes use of control codes to generate text: it requires generations to be started by certain words, sentences
or links to generate coherent text. Refer to the `original implementation <https://github.com/salesforce/ctrl>`__ or links to generate coherent text. Refer to the `original implementation <https://github.com/salesforce/ctrl>`__ for
for more information. more information.
- CTRL is a model with absolute position embeddings so it's usually advised to pad the inputs on - CTRL is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
the right rather than the left. the left.
- CTRL was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next - CTRL was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next
token in a sequence. Leveraging this feature allows CTRL to generate syntactically coherent text as token in a sequence. Leveraging this feature allows CTRL to generate syntactically coherent text as it can be
it can be observed in the `run_generation.py` example script. observed in the `run_generation.py` example script.
- The PyTorch models can take the `past` as input, which is the previously computed key/value attention pairs. Using - The PyTorch models can take the `past` as input, which is the previously computed key/value attention pairs. Using
this `past` value prevents the model from re-computing pre-computed values in the context of text generation. this `past` value prevents the model from re-computing pre-computed values in the context of text generation. See
See `reusing the past in generative models <../quickstart.html#using-the-past>`__ for more information on the usage `reusing the past in generative models <../quickstart.html#using-the-past>`__ for more information on the usage of
of this argument. this argument.
The original code can be found `here <https://github.com/salesforce/ctrl>`__. The original code can be found `here <https://github.com/salesforce/ctrl>`__.
......
DeBERTa DeBERTa
---------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
Overview Overview
~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The DeBERTa model was proposed in `DeBERTa: Decoding-enhanced BERT with Disentangled Attention <https://arxiv.org/abs/2006.03654>`__ The DeBERTa model was proposed in `DeBERTa: Decoding-enhanced BERT with Disentangled Attention
by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen <https://arxiv.org/abs/2006.03654>`__ by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen It is based on Google's
It is based on Google's BERT model released in 2018 and Facebook's RoBERTa model released in 2019. BERT model released in 2018 and Facebook's RoBERTa model released in 2019.
It builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in RoBERTa. It builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in
RoBERTa.
The abstract from the paper is the following: The abstract from the paper is the following:
*Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. *Recent progress in pre-trained neural language models has significantly improved the performance of many natural
In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with
models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the
its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and disentangled attention mechanism, where each word is represented using two vectors that encode its content and
relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to predict the masked tokens for model pretraining. position, respectively, and the attention weights among words are computed using disentangled matrices on their
We show that these two techniques significantly improve the efficiency of model pre-training and performance of downstream tasks. Compared to contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to
RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency
on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and pre-trained of model pre-training and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half
models will be made publicly available at https://github.com/microsoft/DeBERTa.* of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9%
(90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and
pre-trained models will be made publicly available at https://github.com/microsoft/DeBERTa.*
The original code can be found `here <https://github.com/microsoft/DeBERTa>`__. The original code can be found `here <https://github.com/microsoft/DeBERTa>`__.
DebertaConfig DebertaConfig
~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.DebertaConfig .. autoclass:: transformers.DebertaConfig
:members: :members:
DebertaTokenizer DebertaTokenizer
~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.DebertaTokenizer .. autoclass:: transformers.DebertaTokenizer
:members: build_inputs_with_special_tokens, get_special_tokens_mask, :members: build_inputs_with_special_tokens, get_special_tokens_mask,
...@@ -42,21 +45,21 @@ DebertaTokenizer ...@@ -42,21 +45,21 @@ DebertaTokenizer
DebertaModel DebertaModel
~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.DebertaModel .. autoclass:: transformers.DebertaModel
:members: :members:
DebertaPreTrainedModel DebertaPreTrainedModel
~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.DebertaPreTrainedModel .. autoclass:: transformers.DebertaPreTrainedModel
:members: :members:
DebertaForSequenceClassification DebertaForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.DebertaForSequenceClassification .. autoclass:: transformers.DebertaForSequenceClassification
:members: :members:
...@@ -4,36 +4,39 @@ DialoGPT ...@@ -4,36 +4,39 @@ DialoGPT
Overview Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
DialoGPT was proposed in DialoGPT was proposed in `DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation
`DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation <https://arxiv.org/abs/1911.00536>`_ <https://arxiv.org/abs/1911.00536>`_ by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao,
by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan. Jianfeng Gao, Jingjing Liu, Bill Dolan. It's a GPT2 Model trained on 147M conversation-like exchanges extracted from
It's a GPT2 Model trained on 147M conversation-like exchanges extracted from Reddit. Reddit.
The abstract from the paper is the following: The abstract from the paper is the following:
*We present a large, tunable neural conversational response generation model, DialoGPT (dialogue generative pre-trained transformer). *We present a large, tunable neural conversational response generation model, DialoGPT (dialogue generative pre-trained
Trained on 147M conversation-like exchanges extracted from Reddit comment chains over a period spanning from 2005 through 2017, DialoGPT extends the Hugging Face PyTorch transformer to attain a performance close to human both in terms of automatic and human evaluation in single-turn dialogue settings. transformer). Trained on 147M conversation-like exchanges extracted from Reddit comment chains over a period spanning
We show that conversational systems that leverage DialoGPT generate more relevant, contentful and context-consistent responses than strong baseline systems. from 2005 through 2017, DialoGPT extends the Hugging Face PyTorch transformer to attain a performance close to human
The pre-trained model and training pipeline are publicly released to facilitate research into neural response generation and the development of more intelligent open-domain dialogue systems.* both in terms of automatic and human evaluation in single-turn dialogue settings. We show that conversational systems
that leverage DialoGPT generate more relevant, contentful and context-consistent responses than strong baseline
systems. The pre-trained model and training pipeline are publicly released to facilitate research into neural response
generation and the development of more intelligent open-domain dialogue systems.*
Tips: Tips:
- DialoGPT is a model with absolute position embeddings so it's usually advised to pad the inputs on - DialoGPT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather
the right rather than the left. than the left.
- DialoGPT was trained with a causal language modeling (CLM) objective on conversational data and is therefore powerful at response generation in open-domain dialogue systems. - DialoGPT was trained with a causal language modeling (CLM) objective on conversational data and is therefore powerful
- DialoGPT enables the user to create a chat bot in just 10 lines of code as shown on `DialoGPT's model card <https://huggingface.co/microsoft/DialoGPT-medium>`_. at response generation in open-domain dialogue systems.
- DialoGPT enables the user to create a chat bot in just 10 lines of code as shown on `DialoGPT's model card
<https://huggingface.co/microsoft/DialoGPT-medium>`_.
Training: Training:
In order to train or fine-tune DialoGPT, one can use causal language modeling training. In order to train or fine-tune DialoGPT, one can use causal language modeling training. To cite the official paper: *We
To cite the official paper: follow the OpenAI GPT-2 to model a multiturn dialogue session as a long text and frame the generation task as language
*We follow the OpenAI GPT-2 to model a multiturn dialogue session modeling. We first concatenate all dialog turns within a dialogue session into a long text x_1,..., x_N (N is the
as a long text and frame the generation task as language modeling. We first sequence length), ended by the end-of-text token.* For more information please confer to the original paper.
concatenate all dialog turns within a dialogue session into a long text
x_1,..., x_N (N is the sequence length), ended by the end-of-text token.*
For more information please confer to the original paper.
DialoGPT's architecture is based on the GPT2 model, so one can refer to GPT2's `docstring <https://huggingface.co/transformers/model_doc/gpt2.html>`_.
DialoGPT's architecture is based on the GPT2 model, so one can refer to GPT2's `docstring
<https://huggingface.co/transformers/model_doc/gpt2.html>`_.
The original code can be found `here <https://github.com/microsoft/DialoGPT>`_. The original code can be found `here <https://github.com/microsoft/DialoGPT>`_.
...@@ -4,13 +4,12 @@ DistilBERT ...@@ -4,13 +4,12 @@ DistilBERT
Overview Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The DistilBERT model was proposed in the blog post The DistilBERT model was proposed in the blog post `Smaller, faster, cheaper, lighter: Introducing DistilBERT, a
`Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT distilled version of BERT <https://medium.com/huggingface/distilbert-8cf3380435b5>`__, and the paper `DistilBERT, a
<https://medium.com/huggingface/distilbert-8cf3380435b5>`__, and the paper `DistilBERT, a distilled version of BERT: distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`__. DistilBERT is a
smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`__. small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than
DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less `bert-base-uncased`, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language
parameters than `bert-base-uncased`, runs 60% faster while preserving over 95% of BERT's performances as measured on understanding benchmark.
the GLUE language understanding benchmark.
The abstract from the paper is the following: The abstract from the paper is the following:
...@@ -18,13 +17,13 @@ The abstract from the paper is the following: ...@@ -18,13 +17,13 @@ The abstract from the paper is the following:
operating these large models in on-the-edge and/or under constrained computational training or inference budgets operating these large models in on-the-edge and/or under constrained computational training or inference budgets
remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation
model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger
counterparts. While most prior work investigated the use of distillation for building task-specific models, we counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage
leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by
BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive
the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language biases learned by larger models during pre-training, we introduce a triple loss combining language modeling,
modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we
and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device
on-device study.* study.*
Tips: Tips:
...@@ -33,7 +32,8 @@ Tips: ...@@ -33,7 +32,8 @@ Tips:
- DistilBERT doesn't have options to select the input positions (:obj:`position_ids` input). This could be added if - DistilBERT doesn't have options to select the input positions (:obj:`position_ids` input). This could be added if
necessary though, just let us know if you need this option. necessary though, just let us know if you need this option.
The original code can be found `here <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__. The original code can be found `here
<https://github.com/huggingface/transformers/tree/master/examples/distillation>`__.
DistilBertConfig DistilBertConfig
......
...@@ -4,9 +4,9 @@ DPR ...@@ -4,9 +4,9 @@ DPR
Overview Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Dense Passage Retrieval (DPR) is a set of tools and models for state-of-the-art open-domain Q&A research. Dense Passage Retrieval (DPR) is a set of tools and models for state-of-the-art open-domain Q&A research. It was
It was intorduced in `Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`__ intorduced in `Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`__ by
by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih.
The abstract from the paper is the following: The abstract from the paper is the following:
......
...@@ -12,34 +12,28 @@ identify which tokens were replaced by the generator in the sequence. ...@@ -12,34 +12,28 @@ identify which tokens were replaced by the generator in the sequence.
The abstract from the paper is the following: The abstract from the paper is the following:
*Masked language modeling (MLM) pre-training methods such as BERT corrupt *Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with
the input by replacing some tokens with [MASK] and then train a model to [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to
reconstruct the original tokens. While they produce good results when transferred downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a
to downstream NLP tasks, they generally require large amounts of compute to be more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach
effective. As an alternative, we propose a more sample-efficient pre-training task corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead
called replaced token detection. Instead of masking the input, our approach of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that
corrupts it by replacing some tokens with plausible alternatives sampled from a small predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments
generator network. Then, instead of training a model that predicts the original demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens
identities of the corrupted tokens, we train a discriminative model that predicts rather than just the small subset that was masked out. As a result, the contextual representations learned by our
whether each token in the corrupted input was replaced by a generator sample approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are
or not. Thorough experiments demonstrate this new pre-training task is more particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained
efficient than MLM because the task is defined over all input tokens rather than using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale,
just the small subset that was masked out. As a result, the contextual representations where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when
learned by our approach substantially outperform the ones learned by BERT using the same amount of compute.*
given the same model size, data, and compute. The gains are particularly strong
for small models; for example, we train a model on one GPU for 4 days that
outperforms GPT (trained using 30x more compute) on the GLUE natural language
understanding benchmark. Our approach also works well at scale, where it
performs comparably to RoBERTa and XLNet while using less than 1/4 of their
compute and outperforms them when using the same amount of compute.*
Tips: Tips:
- ELECTRA is the pretraining approach, therefore there is nearly no changes done to the underlying model: BERT. The - ELECTRA is the pretraining approach, therefore there is nearly no changes done to the underlying model: BERT. The
only change is the separation of the embedding size and the hidden size: the embedding size is generally smaller, only change is the separation of the embedding size and the hidden size: the embedding size is generally smaller,
while the hidden size is larger. An additional projection layer (linear) is used to project the embeddings from while the hidden size is larger. An additional projection layer (linear) is used to project the embeddings from their
their embedding size to the hidden size. In the case where the embedding size is the same as the hidden size, no embedding size to the hidden size. In the case where the embedding size is the same as the hidden size, no projection
projection layer is used. layer is used.
- The ELECTRA checkpoints saved using `Google Research's implementation <https://github.com/google-research/electra>`__ - The ELECTRA checkpoints saved using `Google Research's implementation <https://github.com/google-research/electra>`__
contain both the generator and discriminator. The conversion script requires the user to name which model to export contain both the generator and discriminator. The conversion script requires the user to name which model to export
into the correct architecture. Once converted to the HuggingFace format, these checkpoints may be loaded into all into the correct architecture. Once converted to the HuggingFace format, these checkpoints may be loaded into all
......
...@@ -13,7 +13,7 @@ any other models (see the examples for more information). ...@@ -13,7 +13,7 @@ any other models (see the examples for more information).
An application of this architecture could be to leverage two pretrained :class:`~transformers.BertModel` as the encoder An application of this architecture could be to leverage two pretrained :class:`~transformers.BertModel` as the encoder
and decoder for a summarization model as was shown in: `Text Summarization with Pretrained Encoders and decoder for a summarization model as was shown in: `Text Summarization with Pretrained Encoders
<https://arxiv.org/abs/1908.08345>`__ by Yang Liu and Mirella Lapata. <https://arxiv.org/abs/1908.08345>`__ by Yang Liu and Mirella Lapata.
EncoderDecoderConfig EncoderDecoderConfig
......
...@@ -11,17 +11,17 @@ modeling (MLM) objective (like BERT). ...@@ -11,17 +11,17 @@ modeling (MLM) objective (like BERT).
The abstract from the paper is the following: The abstract from the paper is the following:
*Language models have become a key step to achieve state-of-the art results in many different Natural Language *Language models have become a key step to achieve state-of-the art results in many different Natural Language
Processing (NLP) tasks. Leveraging the huge amount of unlabeled texts nowadays available, they provide an efficient Processing (NLP) tasks. Leveraging the huge amount of unlabeled texts nowadays available, they provide an efficient way
way to pre-train continuous word representations that can be fine-tuned for a downstream task, along with their to pre-train continuous word representations that can be fine-tuned for a downstream task, along with their
contextualization at the sentence level. This has been widely demonstrated for English using contextualized contextualization at the sentence level. This has been widely demonstrated for English using contextualized
representations (Dai and Le, 2015; Peters et al., 2018; Howard and Ruder, 2018; Radford et al., 2018; Devlin et representations (Dai and Le, 2015; Peters et al., 2018; Howard and Ruder, 2018; Radford et al., 2018; Devlin et al.,
al., 2019; Yang et al., 2019b). In this paper, we introduce and share FlauBERT, a model learned on a very large 2019; Yang et al., 2019b). In this paper, we introduce and share FlauBERT, a model learned on a very large and
and heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for
for Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text
classification, paraphrasing, natural language inference, parsing, word sense disambiguation) and show that most classification, paraphrasing, natural language inference, parsing, word sense disambiguation) and show that most of the
of the time they outperform other pre-training approaches. Different versions of FlauBERT as well as a unified time they outperform other pre-training approaches. Different versions of FlauBERT as well as a unified evaluation
evaluation protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared to the research
to the research community for further reproducible experiments in French NLP.* community for further reproducible experiments in French NLP.*
The original code can be found `here <https://github.com/getalp/Flaubert>`__. The original code can be found `here <https://github.com/getalp/Flaubert>`__.
......
...@@ -58,4 +58,4 @@ FSMTForConditionalGeneration ...@@ -58,4 +58,4 @@ FSMTForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FSMTForConditionalGeneration .. autoclass:: transformers.FSMTForConditionalGeneration
:members: forward :members: forward
\ No newline at end of file
...@@ -30,8 +30,8 @@ Tips: ...@@ -30,8 +30,8 @@ Tips:
directly for tasks that just require a sentence summary (like sequence classification or multiple choice). For other directly for tasks that just require a sentence summary (like sequence classification or multiple choice). For other
tasks, the full model is used; this full model has a decoder that upsamples the final hidden states to the same tasks, the full model is used; this full model has a decoder that upsamples the final hidden states to the same
sequence length as the input. sequence length as the input.
- The Funnel Transformer checkpoints are all available with a full version and a base version. The first ones should - The Funnel Transformer checkpoints are all available with a full version and a base version. The first ones should be
be used for :class:`~transformers.FunnelModel`, :class:`~transformers.FunnelForPreTraining`, used for :class:`~transformers.FunnelModel`, :class:`~transformers.FunnelForPreTraining`,
:class:`~transformers.FunnelForMaskedLM`, :class:`~transformers.FunnelForTokenClassification` and :class:`~transformers.FunnelForMaskedLM`, :class:`~transformers.FunnelForTokenClassification` and
class:`~transformers.FunnelForQuestionAnswering`. The second ones should be used for class:`~transformers.FunnelForQuestionAnswering`. The second ones should be used for
:class:`~transformers.FunnelBaseModel`, :class:`~transformers.FunnelForSequenceClassification` and :class:`~transformers.FunnelBaseModel`, :class:`~transformers.FunnelForSequenceClassification` and
......
...@@ -6,44 +6,39 @@ Overview ...@@ -6,44 +6,39 @@ Overview
OpenAI GPT model was proposed in `Improving Language Understanding by Generative Pre-Training OpenAI GPT model was proposed in `Improving Language Understanding by Generative Pre-Training
<https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf>`__ <https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf>`__
by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. It's a causal (unidirectional) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. It's a causal (unidirectional) transformer
transformer pre-trained using language modeling on a large corpus will long range dependencies, the Toronto Book pre-trained using language modeling on a large corpus will long range dependencies, the Toronto Book Corpus.
Corpus.
The abstract from the paper is the following: The abstract from the paper is the following:
*Natural language understanding comprises a wide range of diverse tasks such *Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering,
as textual entailment, question answering, semantic similarity assessment, and semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant,
document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to
labeled data for learning these specific tasks is scarce, making it challenging for perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a
discriminatively trained models to perform adequately. We demonstrate that large language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In
gains on these tasks can be realized by generative pre-training of a language model contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve
on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our
specific task. In contrast to previous approaches, we make use of task-aware input approach on a wide range of benchmarks for natural language understanding. Our general task-agnostic model outperforms
transformations during fine-tuning to achieve effective transfer while requiring discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon
minimal changes to the model architecture. We demonstrate the effectiveness of the state of the art in 9 out of the 12 tasks studied.*
our approach on a wide range of benchmarks for natural language understanding.
Our general task-agnostic model outperforms discriminatively trained models that
use architectures specifically crafted for each task, significantly improving upon the
state of the art in 9 out of the 12 tasks studied.*
Tips: Tips:
- GPT is a model with absolute position embeddings so it's usually advised to pad the inputs on - GPT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
the right rather than the left. the left.
- GPT was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next - GPT was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next
token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as it can be
it can be observed in the `run_generation.py` example script. observed in the `run_generation.py` example script.
`Write With Transformer <https://transformer.huggingface.co/doc/gpt>`__ is a webapp created and hosted by `Write With Transformer <https://transformer.huggingface.co/doc/gpt>`__ is a webapp created and hosted by Hugging Face
Hugging Face showcasing the generative capabilities of several models. GPT is one of them. showcasing the generative capabilities of several models. GPT is one of them.
The original code can be found `here <https://github.com/openai/finetune-transformer-lm>`__. The original code can be found `here <https://github.com/openai/finetune-transformer-lm>`__.
Note: Note:
If you want to reproduce the original tokenization process of the `OpenAI GPT` paper, you will need to install If you want to reproduce the original tokenization process of the `OpenAI GPT` paper, you will need to install ``ftfy``
``ftfy`` and ``SpaCy``:: and ``SpaCy``::
.. code-block:: bash .. code-block:: bash
...@@ -51,8 +46,7 @@ If you want to reproduce the original tokenization process of the `OpenAI GPT` p ...@@ -51,8 +46,7 @@ If you want to reproduce the original tokenization process of the `OpenAI GPT` p
python -m spacy download en python -m spacy download en
If you don't install ``ftfy`` and ``SpaCy``, the :class:`~transformers.OpenAIGPTTokenizer` will default to tokenize If you don't install ``ftfy`` and ``SpaCy``, the :class:`~transformers.OpenAIGPTTokenizer` will default to tokenize
using BERT's :obj:`BasicTokenizer` followed by Byte-Pair Encoding (which should be fine for most usage, don't using BERT's :obj:`BasicTokenizer` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).
worry).
OpenAIGPTConfig OpenAIGPTConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
......
...@@ -5,29 +5,29 @@ Overview ...@@ -5,29 +5,29 @@ Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
OpenAI GPT-2 model was proposed in `Language Models are Unsupervised Multitask Learners OpenAI GPT-2 model was proposed in `Language Models are Unsupervised Multitask Learners
<https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`_ <https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`_ by Alec
by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. It's a causal (unidirectional) Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. It's a causal (unidirectional)
transformer pretrained using language modeling on a very large corpus of ~40 GB of text data. transformer pretrained using language modeling on a very large corpus of ~40 GB of text data.
The abstract from the paper is the following: The abstract from the paper is the following:
*GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset[1] *GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset[1] of 8 million
of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some
words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks
demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than
the parameters and trained on more than 10X the amount of data.* 10X the amount of data.*
Tips: Tips:
- GPT-2 is a model with absolute position embeddings so it's usually advised to pad the inputs on - GPT-2 is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
the right rather than the left. the left.
- GPT-2 was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next - GPT-2 was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next
token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as it can be
it can be observed in the `run_generation.py` example script. observed in the `run_generation.py` example script.
- The PyTorch models can take the `past` as input, which is the previously computed key/value attention pairs. Using - The PyTorch models can take the `past` as input, which is the previously computed key/value attention pairs. Using
this `past` value prevents the model from re-computing pre-computed values in the context of text generation. this `past` value prevents the model from re-computing pre-computed values in the context of text generation. See
See `reusing the past in generative models <../quickstart.html#using-the-past>`__ for more information on the usage `reusing the past in generative models <../quickstart.html#using-the-past>`__ for more information on the usage of
of this argument. this argument.
`Write With Transformer <https://transformer.huggingface.co/doc/gpt2-large>`__ is a webapp created and hosted by `Write With Transformer <https://transformer.huggingface.co/doc/gpt2-large>`__ is a webapp created and hosted by
Hugging Face showcasing the generative capabilities of several models. GPT-2 is one of them and is available in five Hugging Face showcasing the generative capabilities of several models. GPT-2 is one of them and is available in five
......
LayoutLM LayoutLM
---------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
Overview Overview
~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The LayoutLM model was proposed in the paper `LayoutLM: Pre-training of Text and Layout for Document Image Understanding <https://arxiv.org/abs/1912.13318>`__ The LayoutLM model was proposed in the paper `LayoutLM: Pre-training of Text and Layout for Document Image
by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. It's a simple but effective pre-training method Understanding <https://arxiv.org/abs/1912.13318>`__ by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and
of text and layout for document image understanding and information extraction tasks, such as form understanding and receipt understanding. Ming Zhou. It's a simple but effective pre-training method of text and layout for document image understanding and
information extraction tasks, such as form understanding and receipt understanding.
The abstract from the paper is the following: The abstract from the paper is the following:
*Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the widespread use of pre-training models for NLP applications, they almost exclusively focus on text-level manipulation, while neglecting layout and style information that is vital for document image understanding. In this paper, we propose the \textbf{LayoutLM} to jointly model interactions between text and layout information across scanned document images, which is beneficial for a great number of real-world document image understanding tasks such as information extraction from scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single framework for document-level pre-training. It achieves new state-of-the-art results in several downstream tasks, including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification (from 93.07 to 94.42).* *Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the
widespread use of pre-training models for NLP applications, they almost exclusively focus on text-level manipulation,
while neglecting layout and style information that is vital for document image understanding. In this paper, we propose
the \textbf{LayoutLM} to jointly model interactions between text and layout information across scanned document images,
which is beneficial for a great number of real-world document image understanding tasks such as information extraction
from scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into
LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single
framework for document-level pre-training. It achieves new state-of-the-art results in several downstream tasks,
including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image
classification (from 93.07 to 94.42).*
Tips: Tips:
- LayoutLM has an extra input called :obj:`bbox`, which is the bounding boxes of the input tokens. - LayoutLM has an extra input called :obj:`bbox`, which is the bounding boxes of the input tokens.
- The :obj:`bbox` requires the data that on 0-1000 scale, which means you should normalize the bounding box before passing them into model. - The :obj:`bbox` requires the data that on 0-1000 scale, which means you should normalize the bounding box before
passing them into model.
The original code can be found `here <https://github.com/microsoft/unilm/tree/master/layoutlm>`_. The original code can be found `here <https://github.com/microsoft/unilm/tree/master/layoutlm>`_.
LayoutLMConfig LayoutLMConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LayoutLMConfig .. autoclass:: transformers.LayoutLMConfig
:members: :members:
LayoutLMTokenizer LayoutLMTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LayoutLMTokenizer .. autoclass:: transformers.LayoutLMTokenizer
:members: :members:
LayoutLMModel LayoutLMModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LayoutLMModel .. autoclass:: transformers.LayoutLMModel
:members: :members:
LayoutLMForMaskedLM LayoutLMForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LayoutLMForMaskedLM .. autoclass:: transformers.LayoutLMForMaskedLM
:members: :members:
LayoutLMForTokenClassification LayoutLMForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LayoutLMForTokenClassification .. autoclass:: transformers.LayoutLMForTokenClassification
:members: :members:
...@@ -27,20 +27,20 @@ The Authors' code can be found `here <https://github.com/allenai/longformer>`__. ...@@ -27,20 +27,20 @@ The Authors' code can be found `here <https://github.com/allenai/longformer>`__.
Longformer Self Attention Longformer Self Attention
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Longformer self attention employs self attention on both a "local" context and a "global" context. Longformer self attention employs self attention on both a "local" context and a "global" context. Most tokens only
Most tokens only attend "locally" to each other meaning that each token attends to its :math:`\frac{1}{2} w` previous attend "locally" to each other meaning that each token attends to its :math:`\frac{1}{2} w` previous tokens and
tokens and :math:`\frac{1}{2} w` succeding tokens with :math:`w` being the window length as defined in :math:`\frac{1}{2} w` succeding tokens with :math:`w` being the window length as defined in
:obj:`config.attention_window`. Note that :obj:`config.attention_window` can be of type :obj:`List` to define a :obj:`config.attention_window`. Note that :obj:`config.attention_window` can be of type :obj:`List` to define a
different :math:`w` for each layer. A selected few tokens attend "globally" to all other tokens, as it is different :math:`w` for each layer. A selected few tokens attend "globally" to all other tokens, as it is
conventionally done for all tokens in :obj:`BertSelfAttention`. conventionally done for all tokens in :obj:`BertSelfAttention`.
Note that "locally" and "globally" attending tokens are projected by different query, key and value matrices. Note that "locally" and "globally" attending tokens are projected by different query, key and value matrices. Also note
Also note that every "locally" attending token not only attends to tokens within its window :math:`w`, but also to all that every "locally" attending token not only attends to tokens within its window :math:`w`, but also to all "globally"
"globally" attending tokens so that global attention is *symmetric*. attending tokens so that global attention is *symmetric*.
The user can define which tokens attend "locally" and which tokens attend "globally" by setting the tensor The user can define which tokens attend "locally" and which tokens attend "globally" by setting the tensor
:obj:`global_attention_mask` at run-time appropriately. All Longformer models employ the following logic for :obj:`global_attention_mask` at run-time appropriately. All Longformer models employ the following logic for
:obj:`global_attention_mask`: :obj:`global_attention_mask`:
- 0: the token attends "locally", - 0: the token attends "locally",
- 1: the token attends "globally". - 1: the token attends "globally".
......
...@@ -8,9 +8,8 @@ The LXMERT model was proposed in `LXMERT: Learning Cross-Modality Encoder Repres ...@@ -8,9 +8,8 @@ The LXMERT model was proposed in `LXMERT: Learning Cross-Modality Encoder Repres
<https://arxiv.org/abs/1908.07490>`__ by Hao Tan & Mohit Bansal. It is a series of bidirectional transformer encoders <https://arxiv.org/abs/1908.07490>`__ by Hao Tan & Mohit Bansal. It is a series of bidirectional transformer encoders
(one for the vision modality, one for the language modality, and then one to fuse both modalities) pretrained using a (one for the vision modality, one for the language modality, and then one to fuse both modalities) pretrained using a
combination of masked language modeling, visual-language text alignment, ROI-feature regression, masked combination of masked language modeling, visual-language text alignment, ROI-feature regression, masked
visual-attribute modeling, masked visual-object modeling, and visual-question answering objectives. visual-attribute modeling, masked visual-object modeling, and visual-question answering objectives. The pretraining
The pretraining consists of multiple multi-modal datasets: MSCOCO, Visual-Genome + Visual-Genome Question Answering, consists of multiple multi-modal datasets: MSCOCO, Visual-Genome + Visual-Genome Question Answering, VQA 2.0, and GQA.
VQA 2.0, and GQA.
The abstract from the paper is the following: The abstract from the paper is the following:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment