Unverified Commit 5964f820 authored by Maria Khalusova's avatar Maria Khalusova Committed by GitHub
Browse files

[Docs] Model_doc structure/clarity improvements (#26876)

* first batch of structure improvements for model_docs

* second batch of structure improvements for model_docs

* more structure improvements for model_docs

* more structure improvements for model_docs

* structure improvements for cv model_docs

* more structural refactoring

* addressed feedback about image processors
parent ad8ff962
...@@ -20,11 +20,10 @@ rendered properly in your Markdown viewer. ...@@ -20,11 +20,10 @@ rendered properly in your Markdown viewer.
CPM-Ant is an open-source Chinese pre-trained language model (PLM) with 10B parameters. It is also the first milestone of the live training process of CPM-Live. The training process is cost-effective and environment-friendly. CPM-Ant also achieves promising results with delta tuning on the CUGE benchmark. Besides the full model, we also provide various compressed versions to meet the requirements of different hardware configurations. [See more](https://github.com/OpenBMB/CPM-Live/tree/cpm-ant/cpm-live) CPM-Ant is an open-source Chinese pre-trained language model (PLM) with 10B parameters. It is also the first milestone of the live training process of CPM-Live. The training process is cost-effective and environment-friendly. CPM-Ant also achieves promising results with delta tuning on the CUGE benchmark. Besides the full model, we also provide various compressed versions to meet the requirements of different hardware configurations. [See more](https://github.com/OpenBMB/CPM-Live/tree/cpm-ant/cpm-live)
Tips:
This model was contributed by [OpenBMB](https://huggingface.co/openbmb). The original code can be found [here](https://github.com/OpenBMB/CPM-Live/tree/cpm-ant/cpm-live). This model was contributed by [OpenBMB](https://huggingface.co/openbmb). The original code can be found [here](https://github.com/OpenBMB/CPM-Live/tree/cpm-ant/cpm-live).
⚙️ Training & Inference ## Resources
- A tutorial on [CPM-Live](https://github.com/OpenBMB/CPM-Live/tree/cpm-ant/cpm-live). - A tutorial on [CPM-Live](https://github.com/OpenBMB/CPM-Live/tree/cpm-ant/cpm-live).
## CpmAntConfig ## CpmAntConfig
......
...@@ -41,7 +41,10 @@ providing more explicit control over text generation. These codes also allow CTR ...@@ -41,7 +41,10 @@ providing more explicit control over text generation. These codes also allow CTR
training data are most likely given a sequence. This provides a potential method for analyzing large amounts of data training data are most likely given a sequence. This provides a potential method for analyzing large amounts of data
via model-based source attribution.* via model-based source attribution.*
Tips: This model was contributed by [keskarnitishr](https://huggingface.co/keskarnitishr). The original code can be found
[here](https://github.com/salesforce/ctrl).
## Usage tips
- CTRL makes use of control codes to generate text: it requires generations to be started by certain words, sentences - CTRL makes use of control codes to generate text: it requires generations to be started by certain words, sentences
or links to generate coherent text. Refer to the [original implementation](https://github.com/salesforce/ctrl) for or links to generate coherent text. Refer to the [original implementation](https://github.com/salesforce/ctrl) for
...@@ -56,10 +59,8 @@ Tips: ...@@ -56,10 +59,8 @@ Tips:
pre-computed values in the context of text generation. See the [`forward`](model_doc/ctrl#transformers.CTRLModel.forward) pre-computed values in the context of text generation. See the [`forward`](model_doc/ctrl#transformers.CTRLModel.forward)
method for more information on the usage of this argument. method for more information on the usage of this argument.
This model was contributed by [keskarnitishr](https://huggingface.co/keskarnitishr). The original code can be found
[here](https://github.com/salesforce/ctrl).
## Documentation resources ## Resources
- [Text classification task guide](../tasks/sequence_classification) - [Text classification task guide](../tasks/sequence_classification)
- [Causal language modeling task guide](../tasks/language_modeling) - [Causal language modeling task guide](../tasks/language_modeling)
...@@ -73,6 +74,9 @@ This model was contributed by [keskarnitishr](https://huggingface.co/keskarnitis ...@@ -73,6 +74,9 @@ This model was contributed by [keskarnitishr](https://huggingface.co/keskarnitis
[[autodoc]] CTRLTokenizer [[autodoc]] CTRLTokenizer
- save_vocabulary - save_vocabulary
<frameworkcontent>
<pt>
## CTRLModel ## CTRLModel
[[autodoc]] CTRLModel [[autodoc]] CTRLModel
...@@ -88,6 +92,9 @@ This model was contributed by [keskarnitishr](https://huggingface.co/keskarnitis ...@@ -88,6 +92,9 @@ This model was contributed by [keskarnitishr](https://huggingface.co/keskarnitis
[[autodoc]] CTRLForSequenceClassification [[autodoc]] CTRLForSequenceClassification
- forward - forward
</pt>
<tf>
## TFCTRLModel ## TFCTRLModel
[[autodoc]] TFCTRLModel [[autodoc]] TFCTRLModel
...@@ -102,3 +109,6 @@ This model was contributed by [keskarnitishr](https://huggingface.co/keskarnitis ...@@ -102,3 +109,6 @@ This model was contributed by [keskarnitishr](https://huggingface.co/keskarnitis
[[autodoc]] TFCTRLForSequenceClassification [[autodoc]] TFCTRLForSequenceClassification
- call - call
</tf>
</frameworkcontent>
...@@ -33,15 +33,15 @@ performance gains are maintained when pretrained on larger datasets (\eg ImageNe ...@@ -33,15 +33,15 @@ performance gains are maintained when pretrained on larger datasets (\eg ImageNe
ImageNet-22k, our CvT-W24 obtains a top-1 accuracy of 87.7\% on the ImageNet-1k val set. Finally, our results show that the positional encoding, ImageNet-22k, our CvT-W24 obtains a top-1 accuracy of 87.7\% on the ImageNet-1k val set. Finally, our results show that the positional encoding,
a crucial component in existing Vision Transformers, can be safely removed in our model, simplifying the design for higher resolution vision tasks.* a crucial component in existing Vision Transformers, can be safely removed in our model, simplifying the design for higher resolution vision tasks.*
Tips: This model was contributed by [anugunj](https://huggingface.co/anugunj). The original code can be found [here](https://github.com/microsoft/CvT).
## Usage tips
- CvT models are regular Vision Transformers, but trained with convolutions. They outperform the [original model (ViT)](vit) when fine-tuned on ImageNet-1K and CIFAR-100. - CvT models are regular Vision Transformers, but trained with convolutions. They outperform the [original model (ViT)](vit) when fine-tuned on ImageNet-1K and CIFAR-100.
- You can check out demo notebooks regarding inference as well as fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer) (you can just replace [`ViTFeatureExtractor`] by [`AutoImageProcessor`] and [`ViTForImageClassification`] by [`CvtForImageClassification`]). - You can check out demo notebooks regarding inference as well as fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer) (you can just replace [`ViTFeatureExtractor`] by [`AutoImageProcessor`] and [`ViTForImageClassification`] by [`CvtForImageClassification`]).
- The available checkpoints are either (1) pre-trained on [ImageNet-22k](http://www.image-net.org/) (a collection of 14 million images and 22k classes) only, (2) also fine-tuned on ImageNet-22k or (3) also fine-tuned on [ImageNet-1k](http://www.image-net.org/challenges/LSVRC/2012/) (also referred to as ILSVRC 2012, a collection of 1.3 million - The available checkpoints are either (1) pre-trained on [ImageNet-22k](http://www.image-net.org/) (a collection of 14 million images and 22k classes) only, (2) also fine-tuned on ImageNet-22k or (3) also fine-tuned on [ImageNet-1k](http://www.image-net.org/challenges/LSVRC/2012/) (also referred to as ILSVRC 2012, a collection of 1.3 million
images and 1,000 classes). images and 1,000 classes).
This model was contributed by [anugunj](https://huggingface.co/anugunj). The original code can be found [here](https://github.com/microsoft/CvT).
## Resources ## Resources
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with CvT. A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with CvT.
...@@ -57,6 +57,9 @@ If you're interested in submitting a resource to be included here, please feel f ...@@ -57,6 +57,9 @@ If you're interested in submitting a resource to be included here, please feel f
[[autodoc]] CvtConfig [[autodoc]] CvtConfig
<frameworkcontent>
<pt>
## CvtModel ## CvtModel
[[autodoc]] CvtModel [[autodoc]] CvtModel
...@@ -67,6 +70,9 @@ If you're interested in submitting a resource to be included here, please feel f ...@@ -67,6 +70,9 @@ If you're interested in submitting a resource to be included here, please feel f
[[autodoc]] CvtForImageClassification [[autodoc]] CvtForImageClassification
- forward - forward
</pt>
<tf>
## TFCvtModel ## TFCvtModel
[[autodoc]] TFCvtModel [[autodoc]] TFCvtModel
...@@ -77,3 +83,5 @@ If you're interested in submitting a resource to be included here, please feel f ...@@ -77,3 +83,5 @@ If you're interested in submitting a resource to be included here, please feel f
[[autodoc]] TFCvtForImageClassification [[autodoc]] TFCvtForImageClassification
- call - call
</tf>
</frameworkcontent>
...@@ -35,19 +35,18 @@ the entire input. Experiments on the major benchmarks of speech recognition, ima ...@@ -35,19 +35,18 @@ the entire input. Experiments on the major benchmarks of speech recognition, ima
natural language understanding demonstrate a new state of the art or competitive performance to predominant approaches. natural language understanding demonstrate a new state of the art or competitive performance to predominant approaches.
Models and code are available at www.github.com/pytorch/fairseq/tree/master/examples/data2vec.* Models and code are available at www.github.com/pytorch/fairseq/tree/master/examples/data2vec.*
Tips:
- Data2VecAudio, Data2VecText, and Data2VecVision have all been trained using the same self-supervised learning method.
- For Data2VecAudio, preprocessing is identical to [`Wav2Vec2Model`], including feature extraction
- For Data2VecText, preprocessing is identical to [`RobertaModel`], including tokenization.
- For Data2VecVision, preprocessing is identical to [`BeitModel`], including feature extraction.
This model was contributed by [edugp](https://huggingface.co/edugp) and [patrickvonplaten](https://huggingface.co/patrickvonplaten). This model was contributed by [edugp](https://huggingface.co/edugp) and [patrickvonplaten](https://huggingface.co/patrickvonplaten).
[sayakpaul](https://github.com/sayakpaul) and [Rocketknight1](https://github.com/Rocketknight1) contributed Data2Vec for vision in TensorFlow. [sayakpaul](https://github.com/sayakpaul) and [Rocketknight1](https://github.com/Rocketknight1) contributed Data2Vec for vision in TensorFlow.
The original code (for NLP and Speech) can be found [here](https://github.com/pytorch/fairseq/tree/main/examples/data2vec). The original code (for NLP and Speech) can be found [here](https://github.com/pytorch/fairseq/tree/main/examples/data2vec).
The original code for vision can be found [here](https://github.com/facebookresearch/data2vec_vision/tree/main/beit). The original code for vision can be found [here](https://github.com/facebookresearch/data2vec_vision/tree/main/beit).
## Usage tips
- Data2VecAudio, Data2VecText, and Data2VecVision have all been trained using the same self-supervised learning method.
- For Data2VecAudio, preprocessing is identical to [`Wav2Vec2Model`], including feature extraction
- For Data2VecText, preprocessing is identical to [`RobertaModel`], including tokenization.
- For Data2VecVision, preprocessing is identical to [`BeitModel`], including feature extraction.
## Resources ## Resources
...@@ -88,6 +87,8 @@ If you're interested in submitting a resource to be included here, please feel f ...@@ -88,6 +87,8 @@ If you're interested in submitting a resource to be included here, please feel f
[[autodoc]] Data2VecVisionConfig [[autodoc]] Data2VecVisionConfig
<frameworkcontent>
<pt>
## Data2VecAudioModel ## Data2VecAudioModel
...@@ -164,6 +165,9 @@ If you're interested in submitting a resource to be included here, please feel f ...@@ -164,6 +165,9 @@ If you're interested in submitting a resource to be included here, please feel f
[[autodoc]] Data2VecVisionForSemanticSegmentation [[autodoc]] Data2VecVisionForSemanticSegmentation
- forward - forward
</pt>
<tf>
## TFData2VecVisionModel ## TFData2VecVisionModel
[[autodoc]] TFData2VecVisionModel [[autodoc]] TFData2VecVisionModel
...@@ -178,3 +182,6 @@ If you're interested in submitting a resource to be included here, please feel f ...@@ -178,3 +182,6 @@ If you're interested in submitting a resource to be included here, please feel f
[[autodoc]] TFData2VecVisionForSemanticSegmentation [[autodoc]] TFData2VecVisionForSemanticSegmentation
- call - call
</tf>
</frameworkcontent>
...@@ -62,7 +62,7 @@ New in v2: ...@@ -62,7 +62,7 @@ New in v2:
This model was contributed by [DeBERTa](https://huggingface.co/DeBERTa). This model TF 2.0 implementation was This model was contributed by [DeBERTa](https://huggingface.co/DeBERTa). This model TF 2.0 implementation was
contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/microsoft/DeBERTa). contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/microsoft/DeBERTa).
## Documentation resources ## Resources
- [Text classification task guide](../tasks/sequence_classification) - [Text classification task guide](../tasks/sequence_classification)
- [Token classification task guide](../tasks/token_classification) - [Token classification task guide](../tasks/token_classification)
...@@ -88,6 +88,9 @@ contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code ...@@ -88,6 +88,9 @@ contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code
- build_inputs_with_special_tokens - build_inputs_with_special_tokens
- create_token_type_ids_from_sequences - create_token_type_ids_from_sequences
<frameworkcontent>
<pt>
## DebertaV2Model ## DebertaV2Model
[[autodoc]] DebertaV2Model [[autodoc]] DebertaV2Model
...@@ -123,6 +126,9 @@ contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code ...@@ -123,6 +126,9 @@ contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code
[[autodoc]] DebertaV2ForMultipleChoice [[autodoc]] DebertaV2ForMultipleChoice
- forward - forward
</pt>
<tf>
## TFDebertaV2Model ## TFDebertaV2Model
[[autodoc]] TFDebertaV2Model [[autodoc]] TFDebertaV2Model
...@@ -157,3 +163,6 @@ contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code ...@@ -157,3 +163,6 @@ contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code
[[autodoc]] TFDebertaV2ForMultipleChoice [[autodoc]] TFDebertaV2ForMultipleChoice
- call - call
</tf>
</frameworkcontent>
...@@ -94,6 +94,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h ...@@ -94,6 +94,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
- build_inputs_with_special_tokens - build_inputs_with_special_tokens
- create_token_type_ids_from_sequences - create_token_type_ids_from_sequences
<frameworkcontent>
<pt>
## DebertaModel ## DebertaModel
[[autodoc]] DebertaModel [[autodoc]] DebertaModel
...@@ -123,6 +126,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h ...@@ -123,6 +126,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
[[autodoc]] DebertaForQuestionAnswering [[autodoc]] DebertaForQuestionAnswering
- forward - forward
</pt>
<tf>
## TFDebertaModel ## TFDebertaModel
[[autodoc]] TFDebertaModel [[autodoc]] TFDebertaModel
...@@ -152,3 +158,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h ...@@ -152,3 +158,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
[[autodoc]] TFDebertaForQuestionAnswering [[autodoc]] TFDebertaForQuestionAnswering
- call - call
</tf>
</frameworkcontent>
...@@ -33,9 +33,7 @@ This allows us to draw upon the simplicity and scalability of the Transformer ar ...@@ -33,9 +33,7 @@ This allows us to draw upon the simplicity and scalability of the Transformer ar
Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on
Atari, OpenAI Gym, and Key-to-Door tasks.* Atari, OpenAI Gym, and Key-to-Door tasks.*
Tips: This version of the model is for tasks where the state is a vector.
This version of the model is for tasks where the state is a vector, image-based states will come soon.
This model was contributed by [edbeeching](https://huggingface.co/edbeeching). The original code can be found [here](https://github.com/kzl/decision-transformer). This model was contributed by [edbeeching](https://huggingface.co/edbeeching). The original code can be found [here](https://github.com/kzl/decision-transformer).
......
...@@ -25,11 +25,6 @@ The abstract from the paper is the following: ...@@ -25,11 +25,6 @@ The abstract from the paper is the following:
*DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10 times less training epochs. Extensive experiments on the COCO benchmark demonstrate the effectiveness of our approach.* *DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10 times less training epochs. Extensive experiments on the COCO benchmark demonstrate the effectiveness of our approach.*
Tips:
- One can use [`DeformableDetrImageProcessor`] to prepare images (and optional targets) for the model.
- Training Deformable DETR is equivalent to training the original [DETR](detr) model. See the [resources](#resources) section below for demo notebooks.
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/deformable_detr_architecture.png" <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/deformable_detr_architecture.png"
alt="drawing" width="600"/> alt="drawing" width="600"/>
...@@ -37,6 +32,10 @@ alt="drawing" width="600"/> ...@@ -37,6 +32,10 @@ alt="drawing" width="600"/>
This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/fundamentalvision/Deformable-DETR). This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/fundamentalvision/Deformable-DETR).
## Usage tips
- Training Deformable DETR is equivalent to training the original [DETR](detr) model. See the [resources](#resources) section below for demo notebooks.
## Resources ## Resources
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Deformable DETR. A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Deformable DETR.
......
...@@ -16,13 +16,6 @@ rendered properly in your Markdown viewer. ...@@ -16,13 +16,6 @@ rendered properly in your Markdown viewer.
# DeiT # DeiT
<Tip>
This is a recently introduced model so the API hasn't been tested extensively. There may be some bugs or slight
breaking changes to fix it in the future. If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title).
</Tip>
## Overview ## Overview
The DeiT model was proposed in [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre The DeiT model was proposed in [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre
...@@ -45,7 +38,9 @@ distillation, especially when using a convnet as a teacher. This leads us to rep ...@@ -45,7 +38,9 @@ distillation, especially when using a convnet as a teacher. This leads us to rep
for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and
models.* models.*
Tips: This model was contributed by [nielsr](https://huggingface.co/nielsr). The TensorFlow version of this model was added by [amyeroberts](https://huggingface.co/amyeroberts).
## Usage tips
- Compared to ViT, DeiT models use a so-called distillation token to effectively learn from a teacher (which, in the - Compared to ViT, DeiT models use a so-called distillation token to effectively learn from a teacher (which, in the
DeiT paper, is a ResNet like-model). The distillation token is learned through backpropagation, by interacting with DeiT paper, is a ResNet like-model). The distillation token is learned through backpropagation, by interacting with
...@@ -73,8 +68,6 @@ Tips: ...@@ -73,8 +68,6 @@ Tips:
*facebook/deit-base-patch16-384*. Note that one should use [`DeiTImageProcessor`] in order to *facebook/deit-base-patch16-384*. Note that one should use [`DeiTImageProcessor`] in order to
prepare images for the model. prepare images for the model.
This model was contributed by [nielsr](https://huggingface.co/nielsr). The TensorFlow version of this model was added by [amyeroberts](https://huggingface.co/amyeroberts).
## Resources ## Resources
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DeiT. A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DeiT.
...@@ -104,6 +97,9 @@ If you're interested in submitting a resource to be included here, please feel f ...@@ -104,6 +97,9 @@ If you're interested in submitting a resource to be included here, please feel f
[[autodoc]] DeiTImageProcessor [[autodoc]] DeiTImageProcessor
- preprocess - preprocess
<frameworkcontent>
<pt>
## DeiTModel ## DeiTModel
[[autodoc]] DeiTModel [[autodoc]] DeiTModel
...@@ -124,6 +120,9 @@ If you're interested in submitting a resource to be included here, please feel f ...@@ -124,6 +120,9 @@ If you're interested in submitting a resource to be included here, please feel f
[[autodoc]] DeiTForImageClassificationWithTeacher [[autodoc]] DeiTForImageClassificationWithTeacher
- forward - forward
</pt>
<tf>
## TFDeiTModel ## TFDeiTModel
[[autodoc]] TFDeiTModel [[autodoc]] TFDeiTModel
...@@ -143,3 +142,6 @@ If you're interested in submitting a resource to be included here, please feel f ...@@ -143,3 +142,6 @@ If you're interested in submitting a resource to be included here, please feel f
[[autodoc]] TFDeiTForImageClassificationWithTeacher [[autodoc]] TFDeiTForImageClassificationWithTeacher
- call - call
</tf>
</frameworkcontent>
\ No newline at end of file
...@@ -24,12 +24,10 @@ The abstract of the paper states the following: ...@@ -24,12 +24,10 @@ The abstract of the paper states the following:
*Visual language such as charts and plots is ubiquitous in the human world. Comprehending plots and charts requires strong reasoning skills. Prior state-of-the-art (SOTA) models require at least tens of thousands of training examples and their reasoning capabilities are still much limited, especially on complex human-written queries. This paper presents the first one-shot solution to visual language reasoning. We decompose the challenge of visual language reasoning into two steps: (1) plot-to-text translation, and (2) reasoning over the translated text. The key in this method is a modality conversion module, named as DePlot, which translates the image of a plot or chart to a linearized table. The output of DePlot can then be directly used to prompt a pretrained large language model (LLM), exploiting the few-shot reasoning capabilities of LLMs. To obtain DePlot, we standardize the plot-to-table task by establishing unified task formats and metrics, and train DePlot end-to-end on this task. DePlot can then be used off-the-shelf together with LLMs in a plug-and-play fashion. Compared with a SOTA model finetuned on more than >28k data points, DePlot+LLM with just one-shot prompting achieves a 24.0% improvement over finetuned SOTA on human-written queries from the task of chart QA.* *Visual language such as charts and plots is ubiquitous in the human world. Comprehending plots and charts requires strong reasoning skills. Prior state-of-the-art (SOTA) models require at least tens of thousands of training examples and their reasoning capabilities are still much limited, especially on complex human-written queries. This paper presents the first one-shot solution to visual language reasoning. We decompose the challenge of visual language reasoning into two steps: (1) plot-to-text translation, and (2) reasoning over the translated text. The key in this method is a modality conversion module, named as DePlot, which translates the image of a plot or chart to a linearized table. The output of DePlot can then be directly used to prompt a pretrained large language model (LLM), exploiting the few-shot reasoning capabilities of LLMs. To obtain DePlot, we standardize the plot-to-table task by establishing unified task formats and metrics, and train DePlot end-to-end on this task. DePlot can then be used off-the-shelf together with LLMs in a plug-and-play fashion. Compared with a SOTA model finetuned on more than >28k data points, DePlot+LLM with just one-shot prompting achieves a 24.0% improvement over finetuned SOTA on human-written queries from the task of chart QA.*
## Model description
DePlot is a model that is trained using `Pix2Struct` architecture. You can find more information about `Pix2Struct` in the [Pix2Struct documentation](https://huggingface.co/docs/transformers/main/en/model_doc/pix2struct). DePlot is a model that is trained using `Pix2Struct` architecture. You can find more information about `Pix2Struct` in the [Pix2Struct documentation](https://huggingface.co/docs/transformers/main/en/model_doc/pix2struct).
DePlot is a Visual Question Answering subset of `Pix2Struct` architecture. It renders the input question on the image and predicts the answer. DePlot is a Visual Question Answering subset of `Pix2Struct` architecture. It renders the input question on the image and predicts the answer.
## Usage ## Usage example
Currently one checkpoint is available for DePlot: Currently one checkpoint is available for DePlot:
...@@ -59,4 +57,10 @@ from transformers.optimization import Adafactor, get_cosine_schedule_with_warmup ...@@ -59,4 +57,10 @@ from transformers.optimization import Adafactor, get_cosine_schedule_with_warmup
optimizer = Adafactor(self.parameters(), scale_parameter=False, relative_step=False, lr=0.01, weight_decay=1e-05) optimizer = Adafactor(self.parameters(), scale_parameter=False, relative_step=False, lr=0.01, weight_decay=1e-05)
scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=1000, num_training_steps=40000) scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=1000, num_training_steps=40000)
``` ```
\ No newline at end of file
<Tip>
DePlot is a model trained using `Pix2Struct` architecture. For API reference, see [`Pix2Struct` documentation](pix2struct).
</Tip>
\ No newline at end of file
...@@ -26,10 +26,6 @@ The abstract from the paper is the following: ...@@ -26,10 +26,6 @@ The abstract from the paper is the following:
*Detection Transformer (DETR) directly transforms queries to unique objects by using one-to-one bipartite matching during training and enables end-to-end object detection. Recently, these models have surpassed traditional detectors on COCO with undeniable elegance. However, they differ from traditional detectors in multiple designs, including model architecture and training schedules, and thus the effectiveness of one-to-one matching is not fully understood. In this work, we conduct a strict comparison between the one-to-one Hungarian matching in DETRs and the one-to-many label assignments in traditional detectors with non-maximum supervision (NMS). Surprisingly, we observe one-to-many assignments with NMS consistently outperform standard one-to-one matching under the same setting, with a significant gain of up to 2.5 mAP. Our detector that trains Deformable-DETR with traditional IoU-based label assignment achieved 50.2 COCO mAP within 12 epochs (1x schedule) with ResNet50 backbone, outperforming all existing traditional or transformer-based detectors in this setting. On multiple datasets, schedules, and architectures, we consistently show bipartite matching is unnecessary for performant detection transformers. Furthermore, we attribute the success of detection transformers to their expressive transformer architecture.* *Detection Transformer (DETR) directly transforms queries to unique objects by using one-to-one bipartite matching during training and enables end-to-end object detection. Recently, these models have surpassed traditional detectors on COCO with undeniable elegance. However, they differ from traditional detectors in multiple designs, including model architecture and training schedules, and thus the effectiveness of one-to-one matching is not fully understood. In this work, we conduct a strict comparison between the one-to-one Hungarian matching in DETRs and the one-to-many label assignments in traditional detectors with non-maximum supervision (NMS). Surprisingly, we observe one-to-many assignments with NMS consistently outperform standard one-to-one matching under the same setting, with a significant gain of up to 2.5 mAP. Our detector that trains Deformable-DETR with traditional IoU-based label assignment achieved 50.2 COCO mAP within 12 epochs (1x schedule) with ResNet50 backbone, outperforming all existing traditional or transformer-based detectors in this setting. On multiple datasets, schedules, and architectures, we consistently show bipartite matching is unnecessary for performant detection transformers. Furthermore, we attribute the success of detection transformers to their expressive transformer architecture.*
Tips:
- One can use [`DetaImageProcessor`] to prepare images and optional targets for the model.
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/deta_architecture.jpg" <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/deta_architecture.jpg"
alt="drawing" width="600"/> alt="drawing" width="600"/>
...@@ -51,20 +47,17 @@ If you're interested in submitting a resource to be included here, please feel f ...@@ -51,20 +47,17 @@ If you're interested in submitting a resource to be included here, please feel f
[[autodoc]] DetaConfig [[autodoc]] DetaConfig
## DetaImageProcessor ## DetaImageProcessor
[[autodoc]] DetaImageProcessor [[autodoc]] DetaImageProcessor
- preprocess - preprocess
- post_process_object_detection - post_process_object_detection
## DetaModel ## DetaModel
[[autodoc]] DetaModel [[autodoc]] DetaModel
- forward - forward
## DetaForObjectDetection ## DetaForObjectDetection
[[autodoc]] DetaForObjectDetection [[autodoc]] DetaForObjectDetection
......
...@@ -41,6 +41,8 @@ baselines.* ...@@ -41,6 +41,8 @@ baselines.*
This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/facebookresearch/detr). This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/facebookresearch/detr).
## How DETR works
Here's a TLDR explaining how [`~transformers.DetrForObjectDetection`] works: Here's a TLDR explaining how [`~transformers.DetrForObjectDetection`] works:
First, an image is sent through a pre-trained convolutional backbone (in the paper, the authors use First, an image is sent through a pre-trained convolutional backbone (in the paper, the authors use
...@@ -79,7 +81,7 @@ where one first trains a [`~transformers.DetrForObjectDetection`] model to detec ...@@ -79,7 +81,7 @@ where one first trains a [`~transformers.DetrForObjectDetection`] model to detec
the mask head for 25 epochs. Experimentally, these two approaches give similar results. Note that predicting boxes is the mask head for 25 epochs. Experimentally, these two approaches give similar results. Note that predicting boxes is
required for the training to be possible, since the Hungarian matching is computed using distances between boxes. required for the training to be possible, since the Hungarian matching is computed using distances between boxes.
Tips: ## Usage tips
- DETR uses so-called **object queries** to detect objects in an image. The number of queries determines the maximum - DETR uses so-called **object queries** to detect objects in an image. The number of queries determines the maximum
number of objects that can be detected in a single image, and is set to 100 by default (see parameter number of objects that can be detected in a single image, and is set to 100 by default (see parameter
...@@ -165,14 +167,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h ...@@ -165,14 +167,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
## DETR specific outputs
[[autodoc]] models.detr.modeling_detr.DetrModelOutput
[[autodoc]] models.detr.modeling_detr.DetrObjectDetectionOutput
[[autodoc]] models.detr.modeling_detr.DetrSegmentationOutput
## DetrConfig ## DetrConfig
[[autodoc]] DetrConfig [[autodoc]] DetrConfig
...@@ -195,6 +189,14 @@ If you're interested in submitting a resource to be included here, please feel f ...@@ -195,6 +189,14 @@ If you're interested in submitting a resource to be included here, please feel f
- post_process_instance_segmentation - post_process_instance_segmentation
- post_process_panoptic_segmentation - post_process_panoptic_segmentation
## DETR specific outputs
[[autodoc]] models.detr.modeling_detr.DetrModelOutput
[[autodoc]] models.detr.modeling_detr.DetrObjectDetectionOutput
[[autodoc]] models.detr.modeling_detr.DetrSegmentationOutput
## DetrModel ## DetrModel
[[autodoc]] DetrModel [[autodoc]] DetrModel
......
...@@ -32,7 +32,9 @@ that leverage DialoGPT generate more relevant, contentful and context-consistent ...@@ -32,7 +32,9 @@ that leverage DialoGPT generate more relevant, contentful and context-consistent
systems. The pre-trained model and training pipeline are publicly released to facilitate research into neural response systems. The pre-trained model and training pipeline are publicly released to facilitate research into neural response
generation and the development of more intelligent open-domain dialogue systems.* generation and the development of more intelligent open-domain dialogue systems.*
Tips: The original code can be found [here](https://github.com/microsoft/DialoGPT).
## Usage tips
- DialoGPT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather - DialoGPT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather
than the left. than the left.
...@@ -47,7 +49,8 @@ follow the OpenAI GPT-2 to model a multiturn dialogue session as a long text and ...@@ -47,7 +49,8 @@ follow the OpenAI GPT-2 to model a multiturn dialogue session as a long text and
modeling. We first concatenate all dialog turns within a dialogue session into a long text x_1,..., x_N (N is the modeling. We first concatenate all dialog turns within a dialogue session into a long text x_1,..., x_N (N is the
sequence length), ended by the end-of-text token.* For more information please confer to the original paper. sequence length), ended by the end-of-text token.* For more information please confer to the original paper.
<Tip>
DialoGPT's architecture is based on the GPT2 model, so one can refer to [GPT2's documentation page](gpt2). DialoGPT's architecture is based on the GPT2 model, refer to [GPT2's documentation page](gpt2) for API reference and examples.
The original code can be found [here](https://github.com/microsoft/DialoGPT). </Tip>
...@@ -44,17 +44,6 @@ and ADE20K (48.5 PQ), and instance segmentation model on Cityscapes (44.5 AP) an ...@@ -44,17 +44,6 @@ and ADE20K (48.5 PQ), and instance segmentation model on Cityscapes (44.5 AP) an
It also matches the state of the art specialized semantic segmentation models on ADE20K (58.2 mIoU), It also matches the state of the art specialized semantic segmentation models on ADE20K (58.2 mIoU),
and ranks second on Cityscapes (84.5 mIoU) (no extra data). * and ranks second on Cityscapes (84.5 mIoU) (no extra data). *
Tips:
- One can use the [`AutoImageProcessor`] API to prepare images for the model.
- DiNAT can be used as a *backbone*. When `output_hidden_states = True`,
it will output both `hidden_states` and `reshaped_hidden_states`. The `reshaped_hidden_states` have a shape of `(batch, num_channels, height, width)` rather than `(batch_size, height, width, num_channels)`.
Notes:
- DiNAT depends on [NATTEN](https://github.com/SHI-Labs/NATTEN/)'s implementation of Neighborhood Attention and Dilated Neighborhood Attention.
You can install it with pre-built wheels for Linux by referring to [shi-labs.com/natten](https://shi-labs.com/natten), or build on your system by running `pip install natten`.
Note that the latter will likely take time to compile. NATTEN does not support Windows devices yet.
- Patch size of 4 is only supported at the moment.
<img <img
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/dilated-neighborhood-attention-pattern.jpg" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/dilated-neighborhood-attention-pattern.jpg"
alt="drawing" width="600"/> alt="drawing" width="600"/>
...@@ -65,6 +54,17 @@ Taken from the <a href="https://arxiv.org/abs/2209.15001">original paper</a>.</s ...@@ -65,6 +54,17 @@ Taken from the <a href="https://arxiv.org/abs/2209.15001">original paper</a>.</s
This model was contributed by [Ali Hassani](https://huggingface.co/alihassanijr). This model was contributed by [Ali Hassani](https://huggingface.co/alihassanijr).
The original code can be found [here](https://github.com/SHI-Labs/Neighborhood-Attention-Transformer). The original code can be found [here](https://github.com/SHI-Labs/Neighborhood-Attention-Transformer).
## Usage tips
DiNAT can be used as a *backbone*. When `output_hidden_states = True`,
it will output both `hidden_states` and `reshaped_hidden_states`. The `reshaped_hidden_states` have a shape of `(batch, num_channels, height, width)` rather than `(batch_size, height, width, num_channels)`.
Notes:
- DiNAT depends on [NATTEN](https://github.com/SHI-Labs/NATTEN/)'s implementation of Neighborhood Attention and Dilated Neighborhood Attention.
You can install it with pre-built wheels for Linux by referring to [shi-labs.com/natten](https://shi-labs.com/natten), or build on your system by running `pip install natten`.
Note that the latter will likely take time to compile. NATTEN does not support Windows devices yet.
- Patch size of 4 is only supported at the moment.
## Resources ## Resources
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DiNAT. A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DiNAT.
......
...@@ -22,14 +22,9 @@ The abstract from the paper is the following: ...@@ -22,14 +22,9 @@ The abstract from the paper is the following:
*The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.* *The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.*
Tips:
- One can use [`AutoImageProcessor`] class to prepare images for the model.
This model was contributed by [nielsr](https://huggingface.co/nielsr). This model was contributed by [nielsr](https://huggingface.co/nielsr).
The original code can be found [here](https://github.com/facebookresearch/dinov2). The original code can be found [here](https://github.com/facebookresearch/dinov2).
## Dinov2Config ## Dinov2Config
[[autodoc]] Dinov2Config [[autodoc]] Dinov2Config
......
...@@ -51,7 +51,10 @@ distillation and cosine-distance losses. Our smaller, faster and lighter model i ...@@ -51,7 +51,10 @@ distillation and cosine-distance losses. Our smaller, faster and lighter model i
demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device
study.* study.*
Tips: This model was contributed by [victorsanh](https://huggingface.co/victorsanh). This model jax version was
contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation).
## Usage tips
- DistilBERT doesn't have `token_type_ids`, you don't need to indicate which token belongs to which segment. Just - DistilBERT doesn't have `token_type_ids`, you don't need to indicate which token belongs to which segment. Just
separate your segments with the separation token `tokenizer.sep_token` (or `[SEP]`). separate your segments with the separation token `tokenizer.sep_token` (or `[SEP]`).
...@@ -63,8 +66,6 @@ Tips: ...@@ -63,8 +66,6 @@ Tips:
* predicting the masked tokens correctly (but no next-sentence objective) * predicting the masked tokens correctly (but no next-sentence objective)
* a cosine similarity between the hidden states of the student and the teacher model * a cosine similarity between the hidden states of the student and the teacher model
This model was contributed by [victorsanh](https://huggingface.co/victorsanh). This model jax version was
contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation).
## Resources ## Resources
...@@ -144,6 +145,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h ...@@ -144,6 +145,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
[[autodoc]] DistilBertTokenizerFast [[autodoc]] DistilBertTokenizerFast
<frameworkcontent>
<pt>
## DistilBertModel ## DistilBertModel
[[autodoc]] DistilBertModel [[autodoc]] DistilBertModel
...@@ -174,6 +178,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h ...@@ -174,6 +178,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
[[autodoc]] DistilBertForQuestionAnswering [[autodoc]] DistilBertForQuestionAnswering
- forward - forward
</pt>
<tf>
## TFDistilBertModel ## TFDistilBertModel
[[autodoc]] TFDistilBertModel [[autodoc]] TFDistilBertModel
...@@ -204,6 +211,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h ...@@ -204,6 +211,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
[[autodoc]] TFDistilBertForQuestionAnswering [[autodoc]] TFDistilBertForQuestionAnswering
- call - call
</tf>
<jax>
## FlaxDistilBertModel ## FlaxDistilBertModel
[[autodoc]] FlaxDistilBertModel [[autodoc]] FlaxDistilBertModel
...@@ -233,3 +243,10 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h ...@@ -233,3 +243,10 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
[[autodoc]] FlaxDistilBertForQuestionAnswering [[autodoc]] FlaxDistilBertForQuestionAnswering
- __call__ - __call__
</jax>
</frameworkcontent>
...@@ -37,6 +37,10 @@ alt="drawing" width="600"/> ...@@ -37,6 +37,10 @@ alt="drawing" width="600"/>
<small> Summary of the approach. Taken from the [original paper](https://arxiv.org/abs/2203.02378). </small> <small> Summary of the approach. Taken from the [original paper](https://arxiv.org/abs/2203.02378). </small>
This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/dit).
## Usage tips
One can directly use the weights of DiT with the AutoModel API: One can directly use the weights of DiT with the AutoModel API:
```python ```python
...@@ -66,10 +70,6 @@ model = AutoModelForImageClassification.from_pretrained("microsoft/dit-base-fine ...@@ -66,10 +70,6 @@ model = AutoModelForImageClassification.from_pretrained("microsoft/dit-base-fine
This particular checkpoint was fine-tuned on [RVL-CDIP](https://www.cs.cmu.edu/~aharley/rvl-cdip/), an important benchmark for document image classification. This particular checkpoint was fine-tuned on [RVL-CDIP](https://www.cs.cmu.edu/~aharley/rvl-cdip/), an important benchmark for document image classification.
A notebook that illustrates inference for document image classification can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DiT/Inference_with_DiT_(Document_Image_Transformer)_for_document_image_classification.ipynb). A notebook that illustrates inference for document image classification can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DiT/Inference_with_DiT_(Document_Image_Transformer)_for_document_image_classification.ipynb).
As DiT's architecture is equivalent to that of BEiT, one can refer to [BEiT's documentation page](beit) for all tips, code examples and notebooks.
This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/dit).
## Resources ## Resources
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DiT. A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DiT.
...@@ -78,4 +78,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h ...@@ -78,4 +78,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
- [`BeitForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). - [`BeitForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
\ No newline at end of file
<Tip>
As DiT's architecture is equivalent to that of BEiT, one can refer to [BEiT's documentation page](beit) for all tips, code examples and notebooks.
</Tip>
...@@ -34,14 +34,14 @@ alt="drawing" width="600"/> ...@@ -34,14 +34,14 @@ alt="drawing" width="600"/>
This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found
[here](https://github.com/clovaai/donut). [here](https://github.com/clovaai/donut).
Tips: ## Usage tips
- The quickest way to get started with Donut is by checking the [tutorial - The quickest way to get started with Donut is by checking the [tutorial
notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Donut), which show how to use the model notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Donut), which show how to use the model
at inference time as well as fine-tuning on custom data. at inference time as well as fine-tuning on custom data.
- Donut is always used within the [VisionEncoderDecoder](vision-encoder-decoder) framework. - Donut is always used within the [VisionEncoderDecoder](vision-encoder-decoder) framework.
## Inference ## Inference examples
Donut's [`VisionEncoderDecoder`] model accepts images as input and makes use of Donut's [`VisionEncoderDecoder`] model accepts images as input and makes use of
[`~generation.GenerationMixin.generate`] to autoregressively generate text given the input image. [`~generation.GenerationMixin.generate`] to autoregressively generate text given the input image.
......
...@@ -43,7 +43,8 @@ benchmarks.* ...@@ -43,7 +43,8 @@ benchmarks.*
This model was contributed by [lhoestq](https://huggingface.co/lhoestq). The original code can be found [here](https://github.com/facebookresearch/DPR). This model was contributed by [lhoestq](https://huggingface.co/lhoestq). The original code can be found [here](https://github.com/facebookresearch/DPR).
Tips: ## Usage tips
- DPR consists in three models: - DPR consists in three models:
* Question encoder: encode questions as vectors * Question encoder: encode questions as vectors
...@@ -86,6 +87,9 @@ Tips: ...@@ -86,6 +87,9 @@ Tips:
[[autodoc]] models.dpr.modeling_dpr.DPRReaderOutput [[autodoc]] models.dpr.modeling_dpr.DPRReaderOutput
<frameworkcontent>
<pt>
## DPRContextEncoder ## DPRContextEncoder
[[autodoc]] DPRContextEncoder [[autodoc]] DPRContextEncoder
...@@ -101,6 +105,9 @@ Tips: ...@@ -101,6 +105,9 @@ Tips:
[[autodoc]] DPRReader [[autodoc]] DPRReader
- forward - forward
</pt>
<tf>
## TFDPRContextEncoder ## TFDPRContextEncoder
[[autodoc]] TFDPRContextEncoder [[autodoc]] TFDPRContextEncoder
...@@ -115,3 +122,7 @@ Tips: ...@@ -115,3 +122,7 @@ Tips:
[[autodoc]] TFDPRReader [[autodoc]] TFDPRReader
- call - call
</tf>
</frameworkcontent>
...@@ -56,6 +56,9 @@ The original code can be found [here](https://github.com/snap-research/Efficient ...@@ -56,6 +56,9 @@ The original code can be found [here](https://github.com/snap-research/Efficient
[[autodoc]] EfficientFormerImageProcessor [[autodoc]] EfficientFormerImageProcessor
- preprocess - preprocess
<frameworkcontent>
<pt>
## EfficientFormerModel ## EfficientFormerModel
[[autodoc]] EfficientFormerModel [[autodoc]] EfficientFormerModel
...@@ -71,6 +74,9 @@ The original code can be found [here](https://github.com/snap-research/Efficient ...@@ -71,6 +74,9 @@ The original code can be found [here](https://github.com/snap-research/Efficient
[[autodoc]] EfficientFormerForImageClassificationWithTeacher [[autodoc]] EfficientFormerForImageClassificationWithTeacher
- forward - forward
</pt>
<tf>
## TFEfficientFormerModel ## TFEfficientFormerModel
[[autodoc]] TFEfficientFormerModel [[autodoc]] TFEfficientFormerModel
...@@ -85,3 +91,6 @@ The original code can be found [here](https://github.com/snap-research/Efficient ...@@ -85,3 +91,6 @@ The original code can be found [here](https://github.com/snap-research/Efficient
[[autodoc]] TFEfficientFormerForImageClassificationWithTeacher [[autodoc]] TFEfficientFormerForImageClassificationWithTeacher
- call - call
</tf>
</frameworkcontent>
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment