"docs/vscode:/vscode.git/clone" did not exist on "4eec5d0cf67116e98770c305640b5710571da4f6"
Unverified Commit 5964f820 authored by Maria Khalusova's avatar Maria Khalusova Committed by GitHub
Browse files

[Docs] Model_doc structure/clarity improvements (#26876)

* first batch of structure improvements for model_docs

* second batch of structure improvements for model_docs

* more structure improvements for model_docs

* more structure improvements for model_docs

* structure improvements for cv model_docs

* more structural refactoring

* addressed feedback about image processors
parent ad8ff962
...@@ -44,7 +44,12 @@ approach on a wide range of benchmarks for natural language understanding. Our g ...@@ -44,7 +44,12 @@ approach on a wide range of benchmarks for natural language understanding. Our g
discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon
the state of the art in 9 out of the 12 tasks studied.* the state of the art in 9 out of the 12 tasks studied.*
Tips: [Write With Transformer](https://transformer.huggingface.co/doc/gpt) is a webapp created and hosted by Hugging Face
showcasing the generative capabilities of several models. GPT is one of them.
This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/openai/finetune-transformer-lm).
## Usage tips
- GPT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than - GPT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
the left. the left.
...@@ -52,10 +57,6 @@ Tips: ...@@ -52,10 +57,6 @@ Tips:
token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as it can be token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as it can be
observed in the *run_generation.py* example script. observed in the *run_generation.py* example script.
[Write With Transformer](https://transformer.huggingface.co/doc/gpt) is a webapp created and hosted by Hugging Face
showcasing the generative capabilities of several models. GPT is one of them.
This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/openai/finetune-transformer-lm).
Note: Note:
...@@ -116,6 +117,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h ...@@ -116,6 +117,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
[[autodoc]] models.openai.modeling_tf_openai.TFOpenAIGPTDoubleHeadsModelOutput [[autodoc]] models.openai.modeling_tf_openai.TFOpenAIGPTDoubleHeadsModelOutput
<frameworkcontent>
<pt>
## OpenAIGPTModel ## OpenAIGPTModel
[[autodoc]] OpenAIGPTModel [[autodoc]] OpenAIGPTModel
...@@ -136,6 +140,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h ...@@ -136,6 +140,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
[[autodoc]] OpenAIGPTForSequenceClassification [[autodoc]] OpenAIGPTForSequenceClassification
- forward - forward
</pt>
<tf>
## TFOpenAIGPTModel ## TFOpenAIGPTModel
[[autodoc]] TFOpenAIGPTModel [[autodoc]] TFOpenAIGPTModel
...@@ -155,3 +162,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h ...@@ -155,3 +162,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
[[autodoc]] TFOpenAIGPTForSequenceClassification [[autodoc]] TFOpenAIGPTForSequenceClassification
- call - call
</tf>
</frameworkcontent>
...@@ -25,13 +25,13 @@ The abstract from the paper is the following: ...@@ -25,13 +25,13 @@ The abstract from the paper is the following:
*Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We show that OPT-175B is comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop. We are also releasing our logbook detailing the infrastructure challenges we faced, along with code for experimenting with all of the released models.* *Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We show that OPT-175B is comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop. We are also releasing our logbook detailing the infrastructure challenges we faced, along with code for experimenting with all of the released models.*
This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ), [Younes Belkada](https://huggingface.co/ybelkada), and [Patrick Von Platen](https://huggingface.co/patrickvonplaten).
The original code can be found [here](https://github.com/facebookresearch/metaseq).
Tips: Tips:
- OPT has the same architecture as [`BartDecoder`]. - OPT has the same architecture as [`BartDecoder`].
- Contrary to GPT2, OPT adds the EOS token `</s>` to the beginning of every prompt. - Contrary to GPT2, OPT adds the EOS token `</s>` to the beginning of every prompt.
This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ), [Younes Belkada](https://huggingface.co/ybelkada), and [Patrick Von Platen](https://huggingface.co/patrickvonplaten).
The original code can be found [here](https://github.com/facebookresearch/metaseq).
## Resources ## Resources
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with OPT. If you're A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with OPT. If you're
...@@ -66,6 +66,9 @@ The resource should ideally demonstrate something new instead of duplicating an ...@@ -66,6 +66,9 @@ The resource should ideally demonstrate something new instead of duplicating an
[[autodoc]] OPTConfig [[autodoc]] OPTConfig
<frameworkcontent>
<pt>
## OPTModel ## OPTModel
[[autodoc]] OPTModel [[autodoc]] OPTModel
...@@ -76,6 +79,19 @@ The resource should ideally demonstrate something new instead of duplicating an ...@@ -76,6 +79,19 @@ The resource should ideally demonstrate something new instead of duplicating an
[[autodoc]] OPTForCausalLM [[autodoc]] OPTForCausalLM
- forward - forward
## OPTForSequenceClassification
[[autodoc]] OPTForSequenceClassification
- forward
## OPTForQuestionAnswering
[[autodoc]] OPTForQuestionAnswering
- forward
</pt>
<tf>
## TFOPTModel ## TFOPTModel
[[autodoc]] TFOPTModel [[autodoc]] TFOPTModel
...@@ -86,23 +102,18 @@ The resource should ideally demonstrate something new instead of duplicating an ...@@ -86,23 +102,18 @@ The resource should ideally demonstrate something new instead of duplicating an
[[autodoc]] TFOPTForCausalLM [[autodoc]] TFOPTForCausalLM
- call - call
## OPTForSequenceClassification </tf>
<jax>
[[autodoc]] OPTForSequenceClassification
- forward
## OPTForQuestionAnswering
[[autodoc]] OPTForQuestionAnswering
- forward
## FlaxOPTModel ## FlaxOPTModel
[[autodoc]] FlaxOPTModel [[autodoc]] FlaxOPTModel
- __call__ - __call__
## FlaxOPTForCausalLM ## FlaxOPTForCausalLM
[[autodoc]] FlaxOPTForCausalLM [[autodoc]] FlaxOPTForCausalLM
- __call__ - __call__
</jax>
</frameworkcontent>
...@@ -24,11 +24,6 @@ The abstract from the paper is the following: ...@@ -24,11 +24,6 @@ The abstract from the paper is the following:
*Open-vocabulary object detection has benefited greatly from pretrained vision-language models, but is still limited by the amount of available detection training data. While detection training data can be expanded by using Web image-text pairs as weak supervision, this has not been done at scales comparable to image-level pretraining. Here, we scale up detection data with self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. Major challenges in scaling self-training are the choice of label space, pseudo-annotation filtering, and training efficiency. We present the OWLv2 model and OWL-ST self-training recipe, which address these challenges. OWLv2 surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales (~10M examples). However, with OWL-ST, we can scale to over 1B examples, yielding further large improvement: With an L/14 architecture, OWL-ST improves AP on LVIS rare classes, for which the model has seen no human box annotations, from 31.2% to 44.6% (43% relative improvement). OWL-ST unlocks Web-scale training for open-world localization, similar to what has been seen for image classification and language modelling.* *Open-vocabulary object detection has benefited greatly from pretrained vision-language models, but is still limited by the amount of available detection training data. While detection training data can be expanded by using Web image-text pairs as weak supervision, this has not been done at scales comparable to image-level pretraining. Here, we scale up detection data with self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. Major challenges in scaling self-training are the choice of label space, pseudo-annotation filtering, and training efficiency. We present the OWLv2 model and OWL-ST self-training recipe, which address these challenges. OWLv2 surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales (~10M examples). However, with OWL-ST, we can scale to over 1B examples, yielding further large improvement: With an L/14 architecture, OWL-ST improves AP on LVIS rare classes, for which the model has seen no human box annotations, from 31.2% to 44.6% (43% relative improvement). OWL-ST unlocks Web-scale training for open-world localization, similar to what has been seen for image classification and language modelling.*
Tips:
- The architecture of OWLv2 is identical to [OWL-ViT](owlvit), however the object detection head now also includes an objectness classifier, which predicts the (query-agnostic) likelihood that a predicted box contains an object (as opposed to background). The objectness score can be used to rank or filter predictions independently of text queries.
- Usage of OWLv2 is identical to [OWL-ViT](owlvit) with a new, updated image processor ([`Owlv2ImageProcessor`]).
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/owlv2_overview.png" <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/owlv2_overview.png"
alt="drawing" width="600"/> alt="drawing" width="600"/>
...@@ -37,13 +32,12 @@ alt="drawing" width="600"/> ...@@ -37,13 +32,12 @@ alt="drawing" width="600"/>
This model was contributed by [nielsr](https://huggingface.co/nielsr). This model was contributed by [nielsr](https://huggingface.co/nielsr).
The original code can be found [here](https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit). The original code can be found [here](https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit).
## Usage ## Usage example
OWLv2 is, just like its predecessor [OWL-ViT](owlvit), a zero-shot text-conditioned object detection model. OWL-ViT uses [CLIP](clip) as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. To use CLIP for detection, OWL-ViT removes the final token pooling layer of the vision model and attaches a lightweight classification and box head to each transformer output token. Open-vocabulary classification is enabled by replacing the fixed classification layer weights with the class-name embeddings obtained from the text model. The authors first train CLIP from scratch and fine-tune it end-to-end with the classification and box heads on standard detection datasets using a bipartite matching loss. One or multiple text queries per image can be used to perform zero-shot text-conditioned object detection. OWLv2 is, just like its predecessor [OWL-ViT](owlvit), a zero-shot text-conditioned object detection model. OWL-ViT uses [CLIP](clip) as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. To use CLIP for detection, OWL-ViT removes the final token pooling layer of the vision model and attaches a lightweight classification and box head to each transformer output token. Open-vocabulary classification is enabled by replacing the fixed classification layer weights with the class-name embeddings obtained from the text model. The authors first train CLIP from scratch and fine-tune it end-to-end with the classification and box heads on standard detection datasets using a bipartite matching loss. One or multiple text queries per image can be used to perform zero-shot text-conditioned object detection.
[`Owlv2ImageProcessor`] can be used to resize (or rescale) and normalize images for the model and [`CLIPTokenizer`] is used to encode the text. [`Owlv2Processor`] wraps [`Owlv2ImageProcessor`] and [`CLIPTokenizer`] into a single instance to both encode the text and prepare the images. The following example shows how to perform object detection using [`Owlv2Processor`] and [`Owlv2ForObjectDetection`]. [`Owlv2ImageProcessor`] can be used to resize (or rescale) and normalize images for the model and [`CLIPTokenizer`] is used to encode the text. [`Owlv2Processor`] wraps [`Owlv2ImageProcessor`] and [`CLIPTokenizer`] into a single instance to both encode the text and prepare the images. The following example shows how to perform object detection using [`Owlv2Processor`] and [`Owlv2ForObjectDetection`].
```python ```python
>>> import requests >>> import requests
>>> from PIL import Image >>> from PIL import Image
...@@ -76,7 +70,15 @@ Detected a photo of a cat with confidence 0.665 at location [6.75, 38.97, 326.62 ...@@ -76,7 +70,15 @@ Detected a photo of a cat with confidence 0.665 at location [6.75, 38.97, 326.62
## Resources ## Resources
A demo notebook on using OWLv2 for zero- and one-shot (image-guided) object detection can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/OWLv2). - A demo notebook on using OWLv2 for zero- and one-shot (image-guided) object detection can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/OWLv2).
- [Zero-shot object detection task guide](../tasks/zero_shot_object_detection)
<Tip>
The architecture of OWLv2 is identical to [OWL-ViT](owlvit), however the object detection head now also includes an objectness classifier, which predicts the (query-agnostic) likelihood that a predicted box contains an object (as opposed to background). The objectness score can be used to rank or filter predictions independently of text queries.
Usage of OWLv2 is identical to [OWL-ViT](owlvit) with a new, updated image processor ([`Owlv2ImageProcessor`]).
</Tip>
## Owlv2Config ## Owlv2Config
......
...@@ -31,13 +31,12 @@ alt="drawing" width="600"/> ...@@ -31,13 +31,12 @@ alt="drawing" width="600"/>
This model was contributed by [adirik](https://huggingface.co/adirik). The original code can be found [here](https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit). This model was contributed by [adirik](https://huggingface.co/adirik). The original code can be found [here](https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit).
## Usage ## Usage tips
OWL-ViT is a zero-shot text-conditioned object detection model. OWL-ViT uses [CLIP](clip) as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. To use CLIP for detection, OWL-ViT removes the final token pooling layer of the vision model and attaches a lightweight classification and box head to each transformer output token. Open-vocabulary classification is enabled by replacing the fixed classification layer weights with the class-name embeddings obtained from the text model. The authors first train CLIP from scratch and fine-tune it end-to-end with the classification and box heads on standard detection datasets using a bipartite matching loss. One or multiple text queries per image can be used to perform zero-shot text-conditioned object detection. OWL-ViT is a zero-shot text-conditioned object detection model. OWL-ViT uses [CLIP](clip) as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. To use CLIP for detection, OWL-ViT removes the final token pooling layer of the vision model and attaches a lightweight classification and box head to each transformer output token. Open-vocabulary classification is enabled by replacing the fixed classification layer weights with the class-name embeddings obtained from the text model. The authors first train CLIP from scratch and fine-tune it end-to-end with the classification and box heads on standard detection datasets using a bipartite matching loss. One or multiple text queries per image can be used to perform zero-shot text-conditioned object detection.
[`OwlViTImageProcessor`] can be used to resize (or rescale) and normalize images for the model and [`CLIPTokenizer`] is used to encode the text. [`OwlViTProcessor`] wraps [`OwlViTImageProcessor`] and [`CLIPTokenizer`] into a single instance to both encode the text and prepare the images. The following example shows how to perform object detection using [`OwlViTProcessor`] and [`OwlViTForObjectDetection`]. [`OwlViTImageProcessor`] can be used to resize (or rescale) and normalize images for the model and [`CLIPTokenizer`] is used to encode the text. [`OwlViTProcessor`] wraps [`OwlViTImageProcessor`] and [`CLIPTokenizer`] into a single instance to both encode the text and prepare the images. The following example shows how to perform object detection using [`OwlViTProcessor`] and [`OwlViTForObjectDetection`].
```python ```python
>>> import requests >>> import requests
>>> from PIL import Image >>> from PIL import Image
......
...@@ -25,9 +25,6 @@ rendered properly in your Markdown viewer. ...@@ -25,9 +25,6 @@ rendered properly in your Markdown viewer.
</a> </a>
</div> </div>
**DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title)
and assign @patrickvonplaten.
## Overview ## Overview
...@@ -42,13 +39,17 @@ According to the abstract, ...@@ -42,13 +39,17 @@ According to the abstract,
This model was contributed by [sshleifer](https://huggingface.co/sshleifer). The Authors' code can be found [here](https://github.com/google-research/pegasus). This model was contributed by [sshleifer](https://huggingface.co/sshleifer). The Authors' code can be found [here](https://github.com/google-research/pegasus).
Tips: ## Usage tips
- Sequence-to-sequence model with the same encoder-decoder model architecture as BART. Pegasus is pre-trained jointly on two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pretraining objective, called Gap Sentence Generation (GSG). - Sequence-to-sequence model with the same encoder-decoder model architecture as BART. Pegasus is pre-trained jointly on two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pretraining objective, called Gap Sentence Generation (GSG).
* MLM: encoder input tokens are randomly replaced by a mask tokens and have to be predicted by the encoder (like in BERT) * MLM: encoder input tokens are randomly replaced by a mask tokens and have to be predicted by the encoder (like in BERT)
* GSG: whole encoder input sentences are replaced by a second mask token and fed to the decoder, but which has a causal mask to hide the future words like a regular auto-regressive transformer decoder. * GSG: whole encoder input sentences are replaced by a second mask token and fed to the decoder, but which has a causal mask to hide the future words like a regular auto-regressive transformer decoder.
- FP16 is not supported (help/ideas on this appreciated!).
- The adafactor optimizer is recommended for pegasus fine-tuning.
## Checkpoints ## Checkpoints
All the [checkpoints](https://huggingface.co/models?search=pegasus) are fine-tuned for summarization, besides All the [checkpoints](https://huggingface.co/models?search=pegasus) are fine-tuned for summarization, besides
...@@ -60,20 +61,11 @@ All the [checkpoints](https://huggingface.co/models?search=pegasus) are fine-tun ...@@ -60,20 +61,11 @@ All the [checkpoints](https://huggingface.co/models?search=pegasus) are fine-tun
- Full replication results and correctly pre-processed data can be found in this [Issue](https://github.com/huggingface/transformers/issues/6844#issue-689259666). - Full replication results and correctly pre-processed data can be found in this [Issue](https://github.com/huggingface/transformers/issues/6844#issue-689259666).
- [Distilled checkpoints](https://huggingface.co/models?search=distill-pegasus) are described in this [paper](https://arxiv.org/abs/2010.13002). - [Distilled checkpoints](https://huggingface.co/models?search=distill-pegasus) are described in this [paper](https://arxiv.org/abs/2010.13002).
### Examples
- [Script](https://github.com/huggingface/transformers/tree/main/examples/research_projects/seq2seq-distillation/finetune_pegasus_xsum.sh) to fine-tune pegasus
on the XSUM dataset. Data download instructions at [examples/pytorch/summarization/](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization/README.md).
- FP16 is not supported (help/ideas on this appreciated!).
- The adafactor optimizer is recommended for pegasus fine-tuning.
## Implementation Notes ## Implementation Notes
- All models are transformer encoder-decoders with 16 layers in each component. - All models are transformer encoder-decoders with 16 layers in each component.
- The implementation is completely inherited from [`BartForConditionalGeneration`] - The implementation is completely inherited from [`BartForConditionalGeneration`]
- Some key configuration differences: - Some key configuration differences:
- static, sinusoidal position embeddings - static, sinusoidal position embeddings
- the model starts generating with pad_token_id (which has 0 token_embedding) as the prefix. - the model starts generating with pad_token_id (which has 0 token_embedding) as the prefix.
- more beams are used (`num_beams=8`) - more beams are used (`num_beams=8`)
...@@ -82,7 +74,6 @@ All the [checkpoints](https://huggingface.co/models?search=pegasus) are fine-tun ...@@ -82,7 +74,6 @@ All the [checkpoints](https://huggingface.co/models?search=pegasus) are fine-tun
- The code to convert checkpoints trained in the author's [repo](https://github.com/google-research/pegasus) can be - The code to convert checkpoints trained in the author's [repo](https://github.com/google-research/pegasus) can be
found in `convert_pegasus_tf_to_pytorch.py`. found in `convert_pegasus_tf_to_pytorch.py`.
## Usage Example ## Usage Example
```python ```python
...@@ -106,8 +97,10 @@ All the [checkpoints](https://huggingface.co/models?search=pegasus) are fine-tun ...@@ -106,8 +97,10 @@ All the [checkpoints](https://huggingface.co/models?search=pegasus) are fine-tun
... ) ... )
``` ```
## Documentation resources ## Resources
- [Script](https://github.com/huggingface/transformers/tree/main/examples/research_projects/seq2seq-distillation/finetune_pegasus_xsum.sh) to fine-tune pegasus
on the XSUM dataset. Data download instructions at [examples/pytorch/summarization/](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization/README.md).
- [Causal language modeling task guide](../tasks/language_modeling) - [Causal language modeling task guide](../tasks/language_modeling)
- [Translation task guide](../tasks/translation) - [Translation task guide](../tasks/translation)
- [Summarization task guide](../tasks/summarization) - [Summarization task guide](../tasks/summarization)
...@@ -126,6 +119,9 @@ warning: `add_tokens` does not work at the moment. ...@@ -126,6 +119,9 @@ warning: `add_tokens` does not work at the moment.
[[autodoc]] PegasusTokenizerFast [[autodoc]] PegasusTokenizerFast
<frameworkcontent>
<pt>
## PegasusModel ## PegasusModel
[[autodoc]] PegasusModel [[autodoc]] PegasusModel
...@@ -141,6 +137,9 @@ warning: `add_tokens` does not work at the moment. ...@@ -141,6 +137,9 @@ warning: `add_tokens` does not work at the moment.
[[autodoc]] PegasusForCausalLM [[autodoc]] PegasusForCausalLM
- forward - forward
</pt>
<tf>
## TFPegasusModel ## TFPegasusModel
[[autodoc]] TFPegasusModel [[autodoc]] TFPegasusModel
...@@ -151,6 +150,9 @@ warning: `add_tokens` does not work at the moment. ...@@ -151,6 +150,9 @@ warning: `add_tokens` does not work at the moment.
[[autodoc]] TFPegasusForConditionalGeneration [[autodoc]] TFPegasusForConditionalGeneration
- call - call
</tf>
<jax>
## FlaxPegasusModel ## FlaxPegasusModel
[[autodoc]] FlaxPegasusModel [[autodoc]] FlaxPegasusModel
...@@ -164,3 +166,6 @@ warning: `add_tokens` does not work at the moment. ...@@ -164,3 +166,6 @@ warning: `add_tokens` does not work at the moment.
- __call__ - __call__
- encode - encode
- decode - decode
</jax>
</frameworkcontent>
...@@ -26,10 +26,6 @@ The abstract from the paper is the following: ...@@ -26,10 +26,6 @@ The abstract from the paper is the following:
*While large pretrained Transformer models have proven highly capable at tackling natural language tasks, handling long sequence inputs continues to be a significant challenge. One such task is long input summarization, where inputs are longer than the maximum input context of most pretrained models. Through an extensive set of experiments, we investigate what model architectural changes and pretraining paradigms can most efficiently adapt a pretrained Transformer for long input summarization. We find that a staggered, block-local Transformer with global encoder tokens strikes a good balance of performance and efficiency, and that an additional pretraining phase on long sequences meaningfully improves downstream summarization performance. Based on our findings, we introduce PEGASUS-X, an extension of the PEGASUS model with additional long input pretraining to handle inputs of up to 16K tokens. PEGASUS-X achieves strong performance on long input summarization tasks comparable with much larger models while adding few additional parameters and not requiring model parallelism to train.* *While large pretrained Transformer models have proven highly capable at tackling natural language tasks, handling long sequence inputs continues to be a significant challenge. One such task is long input summarization, where inputs are longer than the maximum input context of most pretrained models. Through an extensive set of experiments, we investigate what model architectural changes and pretraining paradigms can most efficiently adapt a pretrained Transformer for long input summarization. We find that a staggered, block-local Transformer with global encoder tokens strikes a good balance of performance and efficiency, and that an additional pretraining phase on long sequences meaningfully improves downstream summarization performance. Based on our findings, we introduce PEGASUS-X, an extension of the PEGASUS model with additional long input pretraining to handle inputs of up to 16K tokens. PEGASUS-X achieves strong performance on long input summarization tasks comparable with much larger models while adding few additional parameters and not requiring model parallelism to train.*
Tips:
* PEGASUS-X uses the same tokenizer as PEGASUS.
This model was contributed by [zphang](<https://huggingface.co/zphang). The original code can be found [here](https://github.com/google-research/pegasus). This model was contributed by [zphang](<https://huggingface.co/zphang). The original code can be found [here](https://github.com/google-research/pegasus).
## Documentation resources ## Documentation resources
...@@ -37,17 +33,21 @@ This model was contributed by [zphang](<https://huggingface.co/zphang). The orig ...@@ -37,17 +33,21 @@ This model was contributed by [zphang](<https://huggingface.co/zphang). The orig
- [Translation task guide](../tasks/translation) - [Translation task guide](../tasks/translation)
- [Summarization task guide](../tasks/summarization) - [Summarization task guide](../tasks/summarization)
<Tip>
PEGASUS-X uses the same tokenizer as [PEGASUS](pegasus).
</Tip>
## PegasusXConfig ## PegasusXConfig
[[autodoc]] PegasusXConfig [[autodoc]] PegasusXConfig
## PegasusXModel ## PegasusXModel
[[autodoc]] PegasusXModel [[autodoc]] PegasusXModel
- forward - forward
## PegasusXForConditionalGeneration ## PegasusXForConditionalGeneration
[[autodoc]] PegasusXForConditionalGeneration [[autodoc]] PegasusXForConditionalGeneration
......
...@@ -81,7 +81,13 @@ alt="drawing" width="600"/> ...@@ -81,7 +81,13 @@ alt="drawing" width="600"/>
This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found
[here](https://github.com/deepmind/deepmind-research/tree/master/perceiver). [here](https://github.com/deepmind/deepmind-research/tree/master/perceiver).
Tips: <Tip warning={true}>
Perceiver does **not** work with `torch.nn.DataParallel` due to a bug in PyTorch, see [issue #36035](https://github.com/pytorch/pytorch/issues/36035)
</Tip>
## Resources
- The quickest way to get started with the Perceiver is by checking the [tutorial - The quickest way to get started with the Perceiver is by checking the [tutorial
notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Perceiver). notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Perceiver).
...@@ -89,13 +95,6 @@ Tips: ...@@ -89,13 +95,6 @@ Tips:
is implemented in the library. Note that the models available in the library only showcase some examples of what you can do is implemented in the library. Note that the models available in the library only showcase some examples of what you can do
with the Perceiver. There are many more use cases, including question answering, named-entity recognition, object detection, with the Perceiver. There are many more use cases, including question answering, named-entity recognition, object detection,
audio classification, video classification, etc. audio classification, video classification, etc.
**Note**:
- Perceiver does **not** work with `torch.nn.DataParallel` due to a bug in PyTorch, see [issue #36035](https://github.com/pytorch/pytorch/issues/36035)
## Documentation resources
- [Text classification task guide](../tasks/sequence_classification) - [Text classification task guide](../tasks/sequence_classification)
- [Masked language modeling task guide](../tasks/masked_language_modeling) - [Masked language modeling task guide](../tasks/masked_language_modeling)
- [Image classification task guide](../tasks/image_classification) - [Image classification task guide](../tasks/image_classification)
......
...@@ -26,6 +26,10 @@ The authors showcase their approach to model evaluation, focusing on practical t ...@@ -26,6 +26,10 @@ The authors showcase their approach to model evaluation, focusing on practical t
In terms of model details, the work outlines the architecture and training methodology of Persimmon-8B, providing insights into its design choices, sequence length, and dataset composition. The authors present a fast inference code that outperforms traditional implementations through operator fusion and CUDA graph utilization while maintaining code coherence. They express their anticipation of how the community will leverage this contribution to drive innovation, hinting at further upcoming releases as part of an ongoing series of developments. In terms of model details, the work outlines the architecture and training methodology of Persimmon-8B, providing insights into its design choices, sequence length, and dataset composition. The authors present a fast inference code that outperforms traditional implementations through operator fusion and CUDA graph utilization while maintaining code coherence. They express their anticipation of how the community will leverage this contribution to drive innovation, hinting at further upcoming releases as part of an ongoing series of developments.
This model was contributed by [ArthurZ](https://huggingface.co/ArthurZ).
The original code can be found [here](https://github.com/persimmon-ai-labs/adept-inference).
## Usage tips
<Tip warning={true}> <Tip warning={true}>
...@@ -67,8 +71,6 @@ model = PersimmonForCausalLM.from_pretrained("/output/path") ...@@ -67,8 +71,6 @@ model = PersimmonForCausalLM.from_pretrained("/output/path")
tokenizer = PersimmonTokenizer.from_pretrained("/output/path") tokenizer = PersimmonTokenizer.from_pretrained("/output/path")
``` ```
This model was contributed by [ArthurZ](https://huggingface.co/ArthurZ).
The original code can be found [here](https://github.com/persimmon-ai-labs/adept-inference).
- Perismmon uses a `sentencepiece` based tokenizer, with a `Unigram` model. It supports bytefallback, which is only available in `tokenizers==0.14.0` for the fast tokenizer. - Perismmon uses a `sentencepiece` based tokenizer, with a `Unigram` model. It supports bytefallback, which is only available in `tokenizers==0.14.0` for the fast tokenizer.
The `LlamaTokenizer` is used as it is a standard wrapper around sentencepiece. The `chat` template will be updated with the templating functions in a follow up PR! The `LlamaTokenizer` is used as it is a standard wrapper around sentencepiece. The `chat` template will be updated with the templating functions in a follow up PR!
......
...@@ -28,7 +28,9 @@ best pre-trained multilingual model XLM-R (Conneau et al., 2020) and improves th ...@@ -28,7 +28,9 @@ best pre-trained multilingual model XLM-R (Conneau et al., 2020) and improves th
Vietnamese-specific NLP tasks including Part-of-speech tagging, Dependency parsing, Named-entity recognition and Vietnamese-specific NLP tasks including Part-of-speech tagging, Dependency parsing, Named-entity recognition and
Natural language inference.* Natural language inference.*
Example of use: This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/PhoBERT).
## Usage example
```python ```python
>>> import torch >>> import torch
...@@ -50,7 +52,12 @@ Example of use: ...@@ -50,7 +52,12 @@ Example of use:
>>> # phobert = TFAutoModel.from_pretrained("vinai/phobert-base") >>> # phobert = TFAutoModel.from_pretrained("vinai/phobert-base")
``` ```
This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/PhoBERT). <Tip>
PhoBERT implementation is the same as BERT, except for tokenization. Refer to [EART documentation](bert) for information on
configuration classes and their parameters. PhoBERT-specific tokenizer is documented below.
</Tip>
## PhobertTokenizer ## PhobertTokenizer
......
...@@ -39,7 +39,6 @@ The original code can be found [here](https://github.com/google-research/pix2str ...@@ -39,7 +39,6 @@ The original code can be found [here](https://github.com/google-research/pix2str
- [Fine-tuning Notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_pix2struct.ipynb) - [Fine-tuning Notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_pix2struct.ipynb)
- [All models](https://huggingface.co/models?search=pix2struct) - [All models](https://huggingface.co/models?search=pix2struct)
## Pix2StructConfig ## Pix2StructConfig
[[autodoc]] Pix2StructConfig [[autodoc]] Pix2StructConfig
......
...@@ -16,10 +16,7 @@ rendered properly in your Markdown viewer. ...@@ -16,10 +16,7 @@ rendered properly in your Markdown viewer.
# PLBart # PLBart
**DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) and assign ## Overview
[@gchhablani](https://www.github.com/gchhablani).
## Overview of PLBart
The PLBART model was proposed in [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang. The PLBART model was proposed in [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang.
This is a BART-like model which can be used to perform code-summarization, code-generation, and code-translation tasks. The pre-trained model `plbart-base` has been trained using multilingual denoising task This is a BART-like model which can be used to perform code-summarization, code-generation, and code-translation tasks. The pre-trained model `plbart-base` has been trained using multilingual denoising task
...@@ -40,7 +37,7 @@ even with limited annotations.* ...@@ -40,7 +37,7 @@ even with limited annotations.*
This model was contributed by [gchhablani](https://huggingface.co/gchhablani). The Authors' code can be found [here](https://github.com/wasiahmad/PLBART). This model was contributed by [gchhablani](https://huggingface.co/gchhablani). The Authors' code can be found [here](https://github.com/wasiahmad/PLBART).
### Training of PLBart ## Usage examples
PLBart is a multilingual encoder-decoder (sequence-to-sequence) model primarily intended for code-to-text, text-to-code, code-to-code tasks. As the PLBart is a multilingual encoder-decoder (sequence-to-sequence) model primarily intended for code-to-text, text-to-code, code-to-code tasks. As the
model is multilingual it expects the sequences in a different format. A special language id token is added in both the model is multilingual it expects the sequences in a different format. A special language id token is added in both the
...@@ -53,7 +50,7 @@ In cases where the language code is needed, the regular [`~PLBartTokenizer.__cal ...@@ -53,7 +50,7 @@ In cases where the language code is needed, the regular [`~PLBartTokenizer.__cal
when you pass texts as the first argument or with the keyword argument `text`, and will encode target text format if when you pass texts as the first argument or with the keyword argument `text`, and will encode target text format if
it's passed with the `text_target` keyword argument. it's passed with the `text_target` keyword argument.
- Supervised training ### Supervised training
```python ```python
>>> from transformers import PLBartForConditionalGeneration, PLBartTokenizer >>> from transformers import PLBartForConditionalGeneration, PLBartTokenizer
...@@ -65,7 +62,7 @@ it's passed with the `text_target` keyword argument. ...@@ -65,7 +62,7 @@ it's passed with the `text_target` keyword argument.
>>> model(**inputs) >>> model(**inputs)
``` ```
- Generation ### Generation
While generating the target text set the `decoder_start_token_id` to the target language id. The following While generating the target text set the `decoder_start_token_id` to the target language id. The following
example shows how to translate Python to English using the `uclanlp/plbart-python-en_XX` model. example shows how to translate Python to English using the `uclanlp/plbart-python-en_XX` model.
...@@ -82,7 +79,7 @@ it's passed with the `text_target` keyword argument. ...@@ -82,7 +79,7 @@ it's passed with the `text_target` keyword argument.
"Returns the maximum value of a b c." "Returns the maximum value of a b c."
``` ```
## Documentation resources ## Resources
- [Text classification task guide](../tasks/sequence_classification) - [Text classification task guide](../tasks/sequence_classification)
- [Causal language modeling task guide](../tasks/language_modeling) - [Causal language modeling task guide](../tasks/language_modeling)
......
...@@ -28,8 +28,9 @@ The figure below illustrates the architecture of PoolFormer. Taken from the [ori ...@@ -28,8 +28,9 @@ The figure below illustrates the architecture of PoolFormer. Taken from the [ori
<img width="600" src="https://user-images.githubusercontent.com/15921929/142746124-1ab7635d-2536-4a0e-ad43-b4fe2c5a525d.png"/> <img width="600" src="https://user-images.githubusercontent.com/15921929/142746124-1ab7635d-2536-4a0e-ad43-b4fe2c5a525d.png"/>
This model was contributed by [heytanay](https://huggingface.co/heytanay). The original code can be found [here](https://github.com/sail-sg/poolformer).
Tips: ## Usage tips
- PoolFormer has a hierarchical architecture, where instead of Attention, a simple Average Pooling layer is present. All checkpoints of the model can be found on the [hub](https://huggingface.co/models?other=poolformer). - PoolFormer has a hierarchical architecture, where instead of Attention, a simple Average Pooling layer is present. All checkpoints of the model can be found on the [hub](https://huggingface.co/models?other=poolformer).
- One can use [`PoolFormerImageProcessor`] to prepare images for the model. - One can use [`PoolFormerImageProcessor`] to prepare images for the model.
...@@ -43,8 +44,6 @@ Tips: ...@@ -43,8 +44,6 @@ Tips:
| m36 | [6, 6, 18, 6] | [96, 192, 384, 768] | 56 | 82.1 | | m36 | [6, 6, 18, 6] | [96, 192, 384, 768] | 56 | 82.1 |
| m48 | [8, 8, 24, 8] | [96, 192, 384, 768] | 73 | 82.5 | | m48 | [8, 8, 24, 8] | [96, 192, 384, 768] | 73 | 82.5 |
This model was contributed by [heytanay](https://huggingface.co/heytanay). The original code can be found [here](https://github.com/sail-sg/poolformer).
## Resources ## Resources
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with PoolFormer. A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with PoolFormer.
......
...@@ -32,7 +32,6 @@ is transformed to its waveform and passed to the encoder, which transforms it to ...@@ -32,7 +32,6 @@ is transformed to its waveform and passed to the encoder, which transforms it to
uses these latent representations to generate token ids in an autoregressive way. Each token id corresponds to one of four uses these latent representations to generate token ids in an autoregressive way. Each token id corresponds to one of four
different token types: time, velocity, note and 'special'. The token ids are then decoded to their equivalent MIDI file. different token types: time, velocity, note and 'special'. The token ids are then decoded to their equivalent MIDI file.
The abstract from the paper is the following: The abstract from the paper is the following:
*Piano covers of pop music are enjoyed by many people. However, the *Piano covers of pop music are enjoyed by many people. However, the
...@@ -49,22 +48,21 @@ directly from pop audio without using melody and chord extraction ...@@ -49,22 +48,21 @@ directly from pop audio without using melody and chord extraction
modules. We show that Pop2Piano, trained with our dataset, is capable modules. We show that Pop2Piano, trained with our dataset, is capable
of producing plausible piano covers.* of producing plausible piano covers.*
This model was contributed by [Susnato Dhar](https://huggingface.co/susnato).
The original code can be found [here](https://github.com/sweetcocoa/pop2piano).
Tips: ## Usage tips
1. To use Pop2Piano, you will need to install the 🤗 Transformers library, as well as the following third party modules: * To use Pop2Piano, you will need to install the 🤗 Transformers library, as well as the following third party modules:
``` ```
pip install pretty-midi==0.2.9 essentia==2.1b6.dev1034 librosa scipy pip install pretty-midi==0.2.9 essentia==2.1b6.dev1034 librosa scipy
``` ```
Please note that you may need to restart your runtime after installation. Please note that you may need to restart your runtime after installation.
2. Pop2Piano is an Encoder-Decoder based model like T5. * Pop2Piano is an Encoder-Decoder based model like T5.
3. Pop2Piano can be used to generate midi-audio files for a given audio sequence. * Pop2Piano can be used to generate midi-audio files for a given audio sequence.
4. Choosing different composers in `Pop2PianoForConditionalGeneration.generate()` can lead to variety of different results. * Choosing different composers in `Pop2PianoForConditionalGeneration.generate()` can lead to variety of different results.
5. Setting the sampling rate to 44.1 kHz when loading the audio file can give good performance. * Setting the sampling rate to 44.1 kHz when loading the audio file can give good performance.
6. Though Pop2Piano was mainly trained on Korean Pop music, it also does pretty well on other Western Pop or Hip Hop songs. * Though Pop2Piano was mainly trained on Korean Pop music, it also does pretty well on other Western Pop or Hip Hop songs.
This model was contributed by [Susnato Dhar](https://huggingface.co/susnato).
The original code can be found [here](https://github.com/sweetcocoa/pop2piano).
## Examples ## Examples
......
...@@ -25,10 +25,6 @@ rendered properly in your Markdown viewer. ...@@ -25,10 +25,6 @@ rendered properly in your Markdown viewer.
</a> </a>
</div> </div>
**DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) and assign
@patrickvonplaten
## Overview ## Overview
The ProphetNet model was proposed in [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training,](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei The ProphetNet model was proposed in [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training,](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei
...@@ -49,15 +45,15 @@ dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Giga ...@@ -49,15 +45,15 @@ dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Giga
abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new
state-of-the-art results on all these datasets compared to the models using the same scale pretraining corpus.* state-of-the-art results on all these datasets compared to the models using the same scale pretraining corpus.*
Tips: The Authors' code can be found [here](https://github.com/microsoft/ProphetNet).
## Usage tips
- ProphetNet is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than - ProphetNet is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
the left. the left.
- The model architecture is based on the original Transformer, but replaces the “standard” self-attention mechanism in the decoder by a a main self-attention mechanism and a self and n-stream (predict) self-attention mechanism. - The model architecture is based on the original Transformer, but replaces the “standard” self-attention mechanism in the decoder by a a main self-attention mechanism and a self and n-stream (predict) self-attention mechanism.
The Authors' code can be found [here](https://github.com/microsoft/ProphetNet). ## Resources
## Documentation resources
- [Causal language modeling task guide](../tasks/language_modeling) - [Causal language modeling task guide](../tasks/language_modeling)
- [Translation task guide](../tasks/translation) - [Translation task guide](../tasks/translation)
......
...@@ -32,22 +32,18 @@ by processors with high-throughput integer math pipelines. We also present a wor ...@@ -32,22 +32,18 @@ by processors with high-throughput integer math pipelines. We also present a wor
able to maintain accuracy within 1% of the floating-point baseline on all networks studied, including models that are able to maintain accuracy within 1% of the floating-point baseline on all networks studied, including models that are
more difficult to quantize, such as MobileNets and BERT-large.* more difficult to quantize, such as MobileNets and BERT-large.*
Tips: This model was contributed by [shangz](https://huggingface.co/shangz).
## Usage tips
- QDQBERT model adds fake quantization operations (pair of QuantizeLinear/DequantizeLinear ops) to (i) linear layer - QDQBERT model adds fake quantization operations (pair of QuantizeLinear/DequantizeLinear ops) to (i) linear layer
inputs and weights, (ii) matmul inputs, (iii) residual add inputs, in BERT model. inputs and weights, (ii) matmul inputs, (iii) residual add inputs, in BERT model.
- QDQBERT requires the dependency of [Pytorch Quantization Toolkit](https://github.com/NVIDIA/TensorRT/tree/master/tools/pytorch-quantization). To install `pip install pytorch-quantization --extra-index-url https://pypi.ngc.nvidia.com` - QDQBERT requires the dependency of [Pytorch Quantization Toolkit](https://github.com/NVIDIA/TensorRT/tree/master/tools/pytorch-quantization). To install `pip install pytorch-quantization --extra-index-url https://pypi.ngc.nvidia.com`
- QDQBERT model can be loaded from any checkpoint of HuggingFace BERT model (for example *bert-base-uncased*), and - QDQBERT model can be loaded from any checkpoint of HuggingFace BERT model (for example *bert-base-uncased*), and
perform Quantization Aware Training/Post Training Quantization. perform Quantization Aware Training/Post Training Quantization.
- A complete example of using QDQBERT model to perform Quatization Aware Training and Post Training Quantization for - A complete example of using QDQBERT model to perform Quatization Aware Training and Post Training Quantization for
SQUAD task can be found at [transformers/examples/research_projects/quantization-qdqbert/](examples/research_projects/quantization-qdqbert/). SQUAD task can be found at [transformers/examples/research_projects/quantization-qdqbert/](examples/research_projects/quantization-qdqbert/).
This model was contributed by [shangz](https://huggingface.co/shangz).
### Set default quantizers ### Set default quantizers
QDQBERT model adds fake quantization operations (pair of QuantizeLinear/DequantizeLinear ops) to BERT by QDQBERT model adds fake quantization operations (pair of QuantizeLinear/DequantizeLinear ops) to BERT by
...@@ -118,7 +114,7 @@ the instructions in [torch.onnx](https://pytorch.org/docs/stable/onnx.html). Exa ...@@ -118,7 +114,7 @@ the instructions in [torch.onnx](https://pytorch.org/docs/stable/onnx.html). Exa
>>> torch.onnx.export(...) >>> torch.onnx.export(...)
``` ```
## Documentation resources ## Resources
- [Text classification task guide](../tasks/sequence_classification) - [Text classification task guide](../tasks/sequence_classification)
- [Token classification task guide](../tasks/token_classification) - [Token classification task guide](../tasks/token_classification)
......
...@@ -52,8 +52,12 @@ parametric-only seq2seq baseline.* ...@@ -52,8 +52,12 @@ parametric-only seq2seq baseline.*
This model was contributed by [ola13](https://huggingface.co/ola13). This model was contributed by [ola13](https://huggingface.co/ola13).
Tips: ## Usage tips
- Retrieval-augmented generation (“RAG”) models combine the powers of pretrained dense retrieval (DPR) and Seq2Seq models. RAG models retrieve docs, pass them to a seq2seq model, then marginalize to generate outputs. The retriever and seq2seq modules are initialized from pretrained models, and fine-tuned jointly, allowing both retrieval and generation to adapt to downstream tasks.
Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and Seq2Seq models.
RAG models retrieve docs, pass them to a seq2seq model, then marginalize to generate outputs. The retriever and seq2seq
modules are initialized from pretrained models, and fine-tuned jointly, allowing both retrieval and generation to adapt
to downstream tasks.
## RagConfig ## RagConfig
...@@ -73,6 +77,9 @@ Tips: ...@@ -73,6 +77,9 @@ Tips:
[[autodoc]] RagRetriever [[autodoc]] RagRetriever
<frameworkcontent>
<pt>
## RagModel ## RagModel
[[autodoc]] RagModel [[autodoc]] RagModel
...@@ -90,6 +97,9 @@ Tips: ...@@ -90,6 +97,9 @@ Tips:
- forward - forward
- generate - generate
</pt>
<tf>
## TFRagModel ## TFRagModel
[[autodoc]] TFRagModel [[autodoc]] TFRagModel
...@@ -106,3 +116,6 @@ Tips: ...@@ -106,3 +116,6 @@ Tips:
[[autodoc]] TFRagTokenForGeneration [[autodoc]] TFRagTokenForGeneration
- call - call
- generate - generate
</tf>
</frameworkcontent>
...@@ -25,8 +25,6 @@ rendered properly in your Markdown viewer. ...@@ -25,8 +25,6 @@ rendered properly in your Markdown viewer.
</a> </a>
</div> </div>
**DISCLAIMER:** This model is still a work in progress, if you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title).
## Overview ## Overview
The Reformer model was proposed in the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451.pdf) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya. The Reformer model was proposed in the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451.pdf) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
...@@ -44,7 +42,7 @@ while being much more memory-efficient and much faster on long sequences.* ...@@ -44,7 +42,7 @@ while being much more memory-efficient and much faster on long sequences.*
This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The Authors' code can be This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The Authors' code can be
found [here](https://github.com/google/trax/tree/master/trax/models/reformer). found [here](https://github.com/google/trax/tree/master/trax/models/reformer).
Tips: ## Usage tips
- Reformer does **not** work with *torch.nn.DataParallel* due to a bug in PyTorch, see [issue #36035](https://github.com/pytorch/pytorch/issues/36035). - Reformer does **not** work with *torch.nn.DataParallel* due to a bug in PyTorch, see [issue #36035](https://github.com/pytorch/pytorch/issues/36035).
- Use Axial position encoding (see below for more details). It’s a mechanism to avoid having a huge positional encoding matrix (when the sequence length is very big) by factorizing it into smaller matrices. - Use Axial position encoding (see below for more details). It’s a mechanism to avoid having a huge positional encoding matrix (when the sequence length is very big) by factorizing it into smaller matrices.
...@@ -52,7 +50,7 @@ Tips: ...@@ -52,7 +50,7 @@ Tips:
- Avoid storing the intermediate results of each layer by using reversible transformer layers to obtain them during the backward pass (subtracting the residuals from the input of the next layer gives them back) or recomputing them for results inside a given layer (less efficient than storing them but saves memory). - Avoid storing the intermediate results of each layer by using reversible transformer layers to obtain them during the backward pass (subtracting the residuals from the input of the next layer gives them back) or recomputing them for results inside a given layer (less efficient than storing them but saves memory).
- Compute the feedforward operations by chunks and not on the whole batch. - Compute the feedforward operations by chunks and not on the whole batch.
## Axial Positional Encodings ### Axial Positional Encodings
Axial Positional Encodings were first implemented in Google's [trax library](https://github.com/google/trax/blob/4d99ad4965bab1deba227539758d59f0df0fef48/trax/layers/research/position_encodings.py#L29) Axial Positional Encodings were first implemented in Google's [trax library](https://github.com/google/trax/blob/4d99ad4965bab1deba227539758d59f0df0fef48/trax/layers/research/position_encodings.py#L29)
and developed by the authors of this model's paper. In models that are treating very long input sequences, the and developed by the authors of this model's paper. In models that are treating very long input sequences, the
...@@ -96,7 +94,7 @@ product has to be equal to `config.max_embedding_size`, which during training ha ...@@ -96,7 +94,7 @@ product has to be equal to `config.max_embedding_size`, which during training ha
length* of the `input_ids`. length* of the `input_ids`.
## LSH Self Attention ### LSH Self Attention
In Locality sensitive hashing (LSH) self attention the key and query projection weights are tied. Therefore, the key In Locality sensitive hashing (LSH) self attention the key and query projection weights are tied. Therefore, the key
query embedding vectors are also tied. LSH self attention uses the locality sensitive hashing mechanism proposed in query embedding vectors are also tied. LSH self attention uses the locality sensitive hashing mechanism proposed in
...@@ -129,7 +127,7 @@ Using LSH self attention, the memory and time complexity of the query-key matmul ...@@ -129,7 +127,7 @@ Using LSH self attention, the memory and time complexity of the query-key matmul
and time bottleneck in a transformer model, with \\(n_s\\) being the sequence length. and time bottleneck in a transformer model, with \\(n_s\\) being the sequence length.
## Local Self Attention ### Local Self Attention
Local self attention is essentially a "normal" self attention layer with key, query and value projections, but is Local self attention is essentially a "normal" self attention layer with key, query and value projections, but is
chunked so that in each chunk of length `config.local_chunk_length` the query embedding vectors only attends to chunked so that in each chunk of length `config.local_chunk_length` the query embedding vectors only attends to
...@@ -141,7 +139,7 @@ Using Local self attention, the memory and time complexity of the query-key matm ...@@ -141,7 +139,7 @@ Using Local self attention, the memory and time complexity of the query-key matm
and time bottleneck in a transformer model, with \\(n_s\\) being the sequence length. and time bottleneck in a transformer model, with \\(n_s\\) being the sequence length.
## Training ### Training
During training, we must ensure that the sequence length is set to a value that can be divided by the least common During training, we must ensure that the sequence length is set to a value that can be divided by the least common
multiple of `config.lsh_chunk_length` and `config.local_chunk_length` and that the parameters of the Axial multiple of `config.lsh_chunk_length` and `config.local_chunk_length` and that the parameters of the Axial
...@@ -155,7 +153,7 @@ input_ids = tokenizer.encode("This is a sentence from the training data", return ...@@ -155,7 +153,7 @@ input_ids = tokenizer.encode("This is a sentence from the training data", return
loss = model(input_ids, labels=input_ids)[0] loss = model(input_ids, labels=input_ids)[0]
``` ```
## Documentation resources ## Resources
- [Text classification task guide](../tasks/sequence_classification) - [Text classification task guide](../tasks/sequence_classification)
- [Question answering task guide](../tasks/question_answering) - [Question answering task guide](../tasks/question_answering)
......
...@@ -26,15 +26,13 @@ The abstract from the paper is the following: ...@@ -26,15 +26,13 @@ The abstract from the paper is the following:
*In this work, we present a new network design paradigm. Our goal is to help advance the understanding of network design and discover design principles that generalize across settings. Instead of focusing on designing individual network instances, we design network design spaces that parametrize populations of networks. The overall process is analogous to classic manual design of networks, but elevated to the design space level. Using our methodology we explore the structure aspect of network design and arrive at a low-dimensional design space consisting of simple, regular networks that we call RegNet. The core insight of the RegNet parametrization is surprisingly simple: widths and depths of good networks can be explained by a quantized linear function. We analyze the RegNet design space and arrive at interesting findings that do not match the current practice of network design. The RegNet design space provides simple and fast networks that work well across a wide range of flop regimes. Under comparable training settings and flops, the RegNet models outperform the popular EfficientNet models while being up to 5x faster on GPUs.* *In this work, we present a new network design paradigm. Our goal is to help advance the understanding of network design and discover design principles that generalize across settings. Instead of focusing on designing individual network instances, we design network design spaces that parametrize populations of networks. The overall process is analogous to classic manual design of networks, but elevated to the design space level. Using our methodology we explore the structure aspect of network design and arrive at a low-dimensional design space consisting of simple, regular networks that we call RegNet. The core insight of the RegNet parametrization is surprisingly simple: widths and depths of good networks can be explained by a quantized linear function. We analyze the RegNet design space and arrive at interesting findings that do not match the current practice of network design. The RegNet design space provides simple and fast networks that work well across a wide range of flop regimes. Under comparable training settings and flops, the RegNet models outperform the popular EfficientNet models while being up to 5x faster on GPUs.*
Tips:
- One can use [`AutoImageProcessor`] to prepare images for the model.
- The huge 10B model from [Self-supervised Pretraining of Visual Features in the Wild](https://arxiv.org/abs/2103.01988), trained on one billion Instagram images, is available on the [hub](https://huggingface.co/facebook/regnet-y-10b-seer)
This model was contributed by [Francesco](https://huggingface.co/Francesco). The TensorFlow version of the model This model was contributed by [Francesco](https://huggingface.co/Francesco). The TensorFlow version of the model
was contributed by [sayakpaul](https://huggingface.com/sayakpaul) and [ariG23498](https://huggingface.com/ariG23498). was contributed by [sayakpaul](https://huggingface.com/sayakpaul) and [ariG23498](https://huggingface.com/ariG23498).
The original code can be found [here](https://github.com/facebookresearch/pycls). The original code can be found [here](https://github.com/facebookresearch/pycls).
The huge 10B model from [Self-supervised Pretraining of Visual Features in the Wild](https://arxiv.org/abs/2103.01988),
trained on one billion Instagram images, is available on the [hub](https://huggingface.co/facebook/regnet-y-10b-seer)
## Resources ## Resources
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with RegNet. A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with RegNet.
...@@ -50,37 +48,43 @@ If you're interested in submitting a resource to be included here, please feel f ...@@ -50,37 +48,43 @@ If you're interested in submitting a resource to be included here, please feel f
[[autodoc]] RegNetConfig [[autodoc]] RegNetConfig
<frameworkcontent>
<pt>
## RegNetModel ## RegNetModel
[[autodoc]] RegNetModel [[autodoc]] RegNetModel
- forward - forward
## RegNetForImageClassification ## RegNetForImageClassification
[[autodoc]] RegNetForImageClassification [[autodoc]] RegNetForImageClassification
- forward - forward
</pt>
<tf>
## TFRegNetModel ## TFRegNetModel
[[autodoc]] TFRegNetModel [[autodoc]] TFRegNetModel
- call - call
## TFRegNetForImageClassification ## TFRegNetForImageClassification
[[autodoc]] TFRegNetForImageClassification [[autodoc]] TFRegNetForImageClassification
- call - call
</tf>
<jax>
## FlaxRegNetModel ## FlaxRegNetModel
[[autodoc]] FlaxRegNetModel [[autodoc]] FlaxRegNetModel
- __call__ - __call__
## FlaxRegNetForImageClassification ## FlaxRegNetForImageClassification
[[autodoc]] FlaxRegNetForImageClassification [[autodoc]] FlaxRegNetForImageClassification
- __call__ - __call__
\ No newline at end of file </jax>
</frameworkcontent>
...@@ -34,14 +34,14 @@ Transformer representations to be more general and more transferable to other ta ...@@ -34,14 +34,14 @@ Transformer representations to be more general and more transferable to other ta
findings, we are able to train models that achieve strong performance on the XTREME benchmark without increasing the findings, we are able to train models that achieve strong performance on the XTREME benchmark without increasing the
number of parameters at the fine-tuning stage.* number of parameters at the fine-tuning stage.*
Tips: ## Usage tips
For fine-tuning, RemBERT can be thought of as a bigger version of mBERT with an ALBERT-like factorization of the For fine-tuning, RemBERT can be thought of as a bigger version of mBERT with an ALBERT-like factorization of the
embedding layer. The embeddings are not tied in pre-training, in contrast with BERT, which enables smaller input embedding layer. The embeddings are not tied in pre-training, in contrast with BERT, which enables smaller input
embeddings (preserved during fine-tuning) and bigger output embeddings (discarded at fine-tuning). The tokenizer is embeddings (preserved during fine-tuning) and bigger output embeddings (discarded at fine-tuning). The tokenizer is
also similar to the Albert one rather than the BERT one. also similar to the Albert one rather than the BERT one.
## Documentation resources ## Resources
- [Text classification task guide](../tasks/sequence_classification) - [Text classification task guide](../tasks/sequence_classification)
- [Token classification task guide](../tasks/token_classification) - [Token classification task guide](../tasks/token_classification)
...@@ -70,6 +70,9 @@ also similar to the Albert one rather than the BERT one. ...@@ -70,6 +70,9 @@ also similar to the Albert one rather than the BERT one.
- create_token_type_ids_from_sequences - create_token_type_ids_from_sequences
- save_vocabulary - save_vocabulary
<frameworkcontent>
<pt>
## RemBertModel ## RemBertModel
[[autodoc]] RemBertModel [[autodoc]] RemBertModel
...@@ -105,6 +108,9 @@ also similar to the Albert one rather than the BERT one. ...@@ -105,6 +108,9 @@ also similar to the Albert one rather than the BERT one.
[[autodoc]] RemBertForQuestionAnswering [[autodoc]] RemBertForQuestionAnswering
- forward - forward
</pt>
<tf>
## TFRemBertModel ## TFRemBertModel
[[autodoc]] TFRemBertModel [[autodoc]] TFRemBertModel
...@@ -139,3 +145,6 @@ also similar to the Albert one rather than the BERT one. ...@@ -139,3 +145,6 @@ also similar to the Albert one rather than the BERT one.
[[autodoc]] TFRemBertForQuestionAnswering [[autodoc]] TFRemBertForQuestionAnswering
- call - call
</tf>
</frameworkcontent>
...@@ -27,10 +27,6 @@ The abstract from the paper is the following: ...@@ -27,10 +27,6 @@ The abstract from the paper is the following:
*Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. *Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.
The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.* The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.*
Tips:
- One can use [`AutoImageProcessor`] to prepare images for the model.
The figure below illustrates the architecture of ResNet. Taken from the [original paper](https://arxiv.org/abs/1512.03385). The figure below illustrates the architecture of ResNet. Taken from the [original paper](https://arxiv.org/abs/1512.03385).
<img width="600" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/resnet_architecture.png"/> <img width="600" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/resnet_architecture.png"/>
...@@ -52,30 +48,35 @@ If you're interested in submitting a resource to be included here, please feel f ...@@ -52,30 +48,35 @@ If you're interested in submitting a resource to be included here, please feel f
[[autodoc]] ResNetConfig [[autodoc]] ResNetConfig
<frameworkcontent>
<pt>
## ResNetModel ## ResNetModel
[[autodoc]] ResNetModel [[autodoc]] ResNetModel
- forward - forward
## ResNetForImageClassification ## ResNetForImageClassification
[[autodoc]] ResNetForImageClassification [[autodoc]] ResNetForImageClassification
- forward - forward
</pt>
<tf>
## TFResNetModel ## TFResNetModel
[[autodoc]] TFResNetModel [[autodoc]] TFResNetModel
- call - call
## TFResNetForImageClassification ## TFResNetForImageClassification
[[autodoc]] TFResNetForImageClassification [[autodoc]] TFResNetForImageClassification
- call - call
</tf>
<jax>
## FlaxResNetModel ## FlaxResNetModel
[[autodoc]] FlaxResNetModel [[autodoc]] FlaxResNetModel
...@@ -85,3 +86,6 @@ If you're interested in submitting a resource to be included here, please feel f ...@@ -85,3 +86,6 @@ If you're interested in submitting a resource to be included here, please feel f
[[autodoc]] FlaxResNetForImageClassification [[autodoc]] FlaxResNetForImageClassification
- __call__ - __call__
</jax>
</frameworkcontent>
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment