"docs/vscode:/vscode.git/clone" did not exist on "2acedf47214d7a634c193846124832a4686cc8fd"
Unverified Commit 5964f820 authored by Maria Khalusova's avatar Maria Khalusova Committed by GitHub
Browse files

[Docs] Model_doc structure/clarity improvements (#26876)

* first batch of structure improvements for model_docs

* second batch of structure improvements for model_docs

* more structure improvements for model_docs

* more structure improvements for model_docs

* structure improvements for cv model_docs

* more structural refactoring

* addressed feedback about image processors
parent ad8ff962
...@@ -38,7 +38,7 @@ model = GPTNeoXForCausalLM.from_pretrained("EleutherAI/gpt-neox-20b").half().cud ...@@ -38,7 +38,7 @@ model = GPTNeoXForCausalLM.from_pretrained("EleutherAI/gpt-neox-20b").half().cud
GPT-NeoX-20B also has a different tokenizer from the one used in GPT-J-6B and GPT-Neo. The new tokenizer allocates GPT-NeoX-20B also has a different tokenizer from the one used in GPT-J-6B and GPT-Neo. The new tokenizer allocates
additional tokens to whitespace characters, making the model more suitable for certain tasks like code generation. additional tokens to whitespace characters, making the model more suitable for certain tasks like code generation.
### Generation ## Usage example
The `generate()` method can be used to generate text using GPT Neo model. The `generate()` method can be used to generate text using GPT Neo model.
...@@ -61,7 +61,7 @@ The `generate()` method can be used to generate text using GPT Neo model. ...@@ -61,7 +61,7 @@ The `generate()` method can be used to generate text using GPT Neo model.
>>> gen_text = tokenizer.batch_decode(gen_tokens)[0] >>> gen_text = tokenizer.batch_decode(gen_tokens)[0]
``` ```
## Documentation resources ## Resources
- [Causal language modeling task guide](../tasks/language_modeling) - [Causal language modeling task guide](../tasks/language_modeling)
......
...@@ -25,7 +25,7 @@ Following the recommendations from Google's research on [PaLM](https://ai.google ...@@ -25,7 +25,7 @@ Following the recommendations from Google's research on [PaLM](https://ai.google
Development of the model was led by [Shinya Otani](https://github.com/SO0529), [Takayoshi Makabe](https://github.com/spider-man-tm), [Anuj Arora](https://github.com/Anuj040), and [Kyo Hattori](https://github.com/go5paopao) from [ABEJA, Inc.](https://www.abejainc.com/). For more information on this model-building activity, please refer [here (ja)](https://tech-blog.abeja.asia/entry/abeja-gpt-project-202207). Development of the model was led by [Shinya Otani](https://github.com/SO0529), [Takayoshi Makabe](https://github.com/spider-man-tm), [Anuj Arora](https://github.com/Anuj040), and [Kyo Hattori](https://github.com/go5paopao) from [ABEJA, Inc.](https://www.abejainc.com/). For more information on this model-building activity, please refer [here (ja)](https://tech-blog.abeja.asia/entry/abeja-gpt-project-202207).
### Generation ### Usage example
The `generate()` method can be used to generate text using GPT NeoX Japanese model. The `generate()` method can be used to generate text using GPT NeoX Japanese model.
...@@ -51,7 +51,7 @@ The `generate()` method can be used to generate text using GPT NeoX Japanese mod ...@@ -51,7 +51,7 @@ The `generate()` method can be used to generate text using GPT NeoX Japanese mod
人とAIが協調するためにはAIと人が共存しAIを正しく理解する必要があります 人とAIが協調するためにはAIと人が共存しAIを正しく理解する必要があります
``` ```
## Documentation resources ## Resources
- [Causal language modeling task guide](../tasks/language_modeling) - [Causal language modeling task guide](../tasks/language_modeling)
......
...@@ -23,7 +23,7 @@ causal language model trained on [the Pile](https://pile.eleuther.ai/) dataset. ...@@ -23,7 +23,7 @@ causal language model trained on [the Pile](https://pile.eleuther.ai/) dataset.
This model was contributed by [Stella Biderman](https://huggingface.co/stellaathena). This model was contributed by [Stella Biderman](https://huggingface.co/stellaathena).
Tips: ## Usage tips
- To load [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B) in float32 one would need at least 2x model size - To load [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B) in float32 one would need at least 2x model size
RAM: 1x for initial weights and another 1x to load the checkpoint. So for GPT-J it would take at least 48GB RAM: 1x for initial weights and another 1x to load the checkpoint. So for GPT-J it would take at least 48GB
...@@ -56,7 +56,7 @@ Tips: ...@@ -56,7 +56,7 @@ Tips:
size, the tokenizer for [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B) contains 143 extra tokens size, the tokenizer for [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B) contains 143 extra tokens
`<|extratoken_1|>... <|extratoken_143|>`, so the `vocab_size` of tokenizer also becomes 50400. `<|extratoken_1|>... <|extratoken_143|>`, so the `vocab_size` of tokenizer also becomes 50400.
### Generation ## Usage examples
The [`~generation.GenerationMixin.generate`] method can be used to generate text using GPT-J The [`~generation.GenerationMixin.generate`] method can be used to generate text using GPT-J
model. model.
...@@ -138,6 +138,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h ...@@ -138,6 +138,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
[[autodoc]] GPTJConfig [[autodoc]] GPTJConfig
- all - all
<frameworkcontent>
<pt>
## GPTJModel ## GPTJModel
[[autodoc]] GPTJModel [[autodoc]] GPTJModel
...@@ -158,6 +161,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h ...@@ -158,6 +161,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
[[autodoc]] GPTJForQuestionAnswering [[autodoc]] GPTJForQuestionAnswering
- forward - forward
</pt>
<tf>
## TFGPTJModel ## TFGPTJModel
[[autodoc]] TFGPTJModel [[autodoc]] TFGPTJModel
...@@ -178,6 +184,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h ...@@ -178,6 +184,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
[[autodoc]] TFGPTJForQuestionAnswering [[autodoc]] TFGPTJForQuestionAnswering
- call - call
</tf>
<jax>
## FlaxGPTJModel ## FlaxGPTJModel
[[autodoc]] FlaxGPTJModel [[autodoc]] FlaxGPTJModel
...@@ -187,3 +196,5 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h ...@@ -187,3 +196,5 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
[[autodoc]] FlaxGPTJForCausalLM [[autodoc]] FlaxGPTJForCausalLM
- __call__ - __call__
</jax>
</frameworkcontent>
...@@ -24,7 +24,7 @@ GPTSAN is a Japanese language model using Switch Transformer. It has the same st ...@@ -24,7 +24,7 @@ GPTSAN is a Japanese language model using Switch Transformer. It has the same st
in the T5 paper, and support both Text Generation and Masked Language Modeling tasks. These basic tasks similarly can in the T5 paper, and support both Text Generation and Masked Language Modeling tasks. These basic tasks similarly can
fine-tune for translation or summarization. fine-tune for translation or summarization.
### Generation ### Usage example
The `generate()` method can be used to generate text using GPTSAN-Japanese model. The `generate()` method can be used to generate text using GPTSAN-Japanese model.
...@@ -56,7 +56,7 @@ This length applies to the text entered in `prefix_text` for the tokenizer. ...@@ -56,7 +56,7 @@ This length applies to the text entered in `prefix_text` for the tokenizer.
The tokenizer returns the mask of the `Prefix` part of Prefix-LM as `token_type_ids`. The tokenizer returns the mask of the `Prefix` part of Prefix-LM as `token_type_ids`.
The model treats the part where `token_type_ids` is 1 as a `Prefix` part, that is, the input can refer to both tokens before and after. The model treats the part where `token_type_ids` is 1 as a `Prefix` part, that is, the input can refer to both tokens before and after.
Tips: ## Usage tips
Specifying the Prefix part is done with a mask passed to self-attention. Specifying the Prefix part is done with a mask passed to self-attention.
When token_type_ids=None or all zero, it is equivalent to regular causal mask When token_type_ids=None or all zero, it is equivalent to regular causal mask
......
...@@ -23,26 +23,24 @@ The abstract from the paper is the following: ...@@ -23,26 +23,24 @@ The abstract from the paper is the following:
*The Transformer architecture has become a dominant choice in many domains, such as natural language processing and computer vision. Yet, it has not achieved competitive performance on popular leaderboards of graph-level prediction compared to mainstream GNN variants. Therefore, it remains a mystery how Transformers could perform well for graph representation learning. In this paper, we solve this mystery by presenting Graphormer, which is built upon the standard Transformer architecture, and could attain excellent results on a broad range of graph representation learning tasks, especially on the recent OGB Large-Scale Challenge. Our key insight to utilizing Transformer in the graph is the necessity of effectively encoding the structural information of a graph into the model. To this end, we propose several simple yet effective structural encoding methods to help Graphormer better model graph-structured data. Besides, we mathematically characterize the expressive power of Graphormer and exhibit that with our ways of encoding the structural information of graphs, many popular GNN variants could be covered as the special cases of Graphormer.* *The Transformer architecture has become a dominant choice in many domains, such as natural language processing and computer vision. Yet, it has not achieved competitive performance on popular leaderboards of graph-level prediction compared to mainstream GNN variants. Therefore, it remains a mystery how Transformers could perform well for graph representation learning. In this paper, we solve this mystery by presenting Graphormer, which is built upon the standard Transformer architecture, and could attain excellent results on a broad range of graph representation learning tasks, especially on the recent OGB Large-Scale Challenge. Our key insight to utilizing Transformer in the graph is the necessity of effectively encoding the structural information of a graph into the model. To this end, we propose several simple yet effective structural encoding methods to help Graphormer better model graph-structured data. Besides, we mathematically characterize the expressive power of Graphormer and exhibit that with our ways of encoding the structural information of graphs, many popular GNN variants could be covered as the special cases of Graphormer.*
Tips: This model was contributed by [clefourrier](https://huggingface.co/clefourrier). The original code can be found [here](https://github.com/microsoft/Graphormer).
## Usage tips
This model will not work well on large graphs (more than 100 nodes/edges), as it will make the memory explode. This model will not work well on large graphs (more than 100 nodes/edges), as it will make the memory explode.
You can reduce the batch size, increase your RAM, or decrease the `UNREACHABLE_NODE_DISTANCE` parameter in algos_graphormer.pyx, but it will be hard to go above 700 nodes/edges. You can reduce the batch size, increase your RAM, or decrease the `UNREACHABLE_NODE_DISTANCE` parameter in algos_graphormer.pyx, but it will be hard to go above 700 nodes/edges.
This model does not use a tokenizer, but instead a special collator during training. This model does not use a tokenizer, but instead a special collator during training.
This model was contributed by [clefourrier](https://huggingface.co/clefourrier). The original code can be found [here](https://github.com/microsoft/Graphormer).
## GraphormerConfig ## GraphormerConfig
[[autodoc]] GraphormerConfig [[autodoc]] GraphormerConfig
## GraphormerModel ## GraphormerModel
[[autodoc]] GraphormerModel [[autodoc]] GraphormerModel
- forward - forward
## GraphormerForGraphClassification ## GraphormerForGraphClassification
[[autodoc]] GraphormerForGraphClassification [[autodoc]] GraphormerForGraphClassification
......
...@@ -25,13 +25,13 @@ The abstract from the paper is the following: ...@@ -25,13 +25,13 @@ The abstract from the paper is the following:
*Grouping and recognition are important components of visual scene understanding, e.g., for object detection and semantic segmentation. With end-to-end deep learning systems, grouping of image regions usually happens implicitly via top-down supervision from pixel-level recognition labels. Instead, in this paper, we propose to bring back the grouping mechanism into deep networks, which allows semantic segments to emerge automatically with only text supervision. We propose a hierarchical Grouping Vision Transformer (GroupViT), which goes beyond the regular grid structure representation and learns to group image regions into progressively larger arbitrary-shaped segments. We train GroupViT jointly with a text encoder on a large-scale image-text dataset via contrastive losses. With only text supervision and without any pixel-level annotations, GroupViT learns to group together semantic regions and successfully transfers to the task of semantic segmentation in a zero-shot manner, i.e., without any further fine-tuning. It achieves a zero-shot accuracy of 52.3% mIoU on the PASCAL VOC 2012 and 22.4% mIoU on PASCAL Context datasets, and performs competitively to state-of-the-art transfer-learning methods requiring greater levels of supervision.* *Grouping and recognition are important components of visual scene understanding, e.g., for object detection and semantic segmentation. With end-to-end deep learning systems, grouping of image regions usually happens implicitly via top-down supervision from pixel-level recognition labels. Instead, in this paper, we propose to bring back the grouping mechanism into deep networks, which allows semantic segments to emerge automatically with only text supervision. We propose a hierarchical Grouping Vision Transformer (GroupViT), which goes beyond the regular grid structure representation and learns to group image regions into progressively larger arbitrary-shaped segments. We train GroupViT jointly with a text encoder on a large-scale image-text dataset via contrastive losses. With only text supervision and without any pixel-level annotations, GroupViT learns to group together semantic regions and successfully transfers to the task of semantic segmentation in a zero-shot manner, i.e., without any further fine-tuning. It achieves a zero-shot accuracy of 52.3% mIoU on the PASCAL VOC 2012 and 22.4% mIoU on PASCAL Context datasets, and performs competitively to state-of-the-art transfer-learning methods requiring greater levels of supervision.*
Tips:
- You may specify `output_segmentation=True` in the forward of `GroupViTModel` to get the segmentation logits of input texts.
This model was contributed by [xvjiarui](https://huggingface.co/xvjiarui). The TensorFlow version was contributed by [ariG23498](https://huggingface.co/ariG23498) with the help of [Yih-Dar SHIEH](https://huggingface.co/ydshieh), [Amy Roberts](https://huggingface.co/amyeroberts), and [Joao Gante](https://huggingface.co/joaogante). This model was contributed by [xvjiarui](https://huggingface.co/xvjiarui). The TensorFlow version was contributed by [ariG23498](https://huggingface.co/ariG23498) with the help of [Yih-Dar SHIEH](https://huggingface.co/ydshieh), [Amy Roberts](https://huggingface.co/amyeroberts), and [Joao Gante](https://huggingface.co/joaogante).
The original code can be found [here](https://github.com/NVlabs/GroupViT). The original code can be found [here](https://github.com/NVlabs/GroupViT).
## Usage tips
- You may specify `output_segmentation=True` in the forward of `GroupViTModel` to get the segmentation logits of input texts.
## Resources ## Resources
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with GroupViT. A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with GroupViT.
...@@ -52,6 +52,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h ...@@ -52,6 +52,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
[[autodoc]] GroupViTVisionConfig [[autodoc]] GroupViTVisionConfig
<frameworkcontent>
<pt>
## GroupViTModel ## GroupViTModel
[[autodoc]] GroupViTModel [[autodoc]] GroupViTModel
...@@ -69,6 +72,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h ...@@ -69,6 +72,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
[[autodoc]] GroupViTVisionModel [[autodoc]] GroupViTVisionModel
- forward - forward
</pt>
<tf>
## TFGroupViTModel ## TFGroupViTModel
[[autodoc]] TFGroupViTModel [[autodoc]] TFGroupViTModel
...@@ -84,4 +90,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h ...@@ -84,4 +90,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
## TFGroupViTVisionModel ## TFGroupViTVisionModel
[[autodoc]] TFGroupViTVisionModel [[autodoc]] TFGroupViTVisionModel
- call - call
\ No newline at end of file
</tf>
</frameworkcontent>
...@@ -37,7 +37,11 @@ which has the best average performance and obtains the best results for three ou ...@@ -37,7 +37,11 @@ which has the best average performance and obtains the best results for three ou
extensive evaluation, including several standard baselines and recently proposed, multilingual Transformer-based extensive evaluation, including several standard baselines and recently proposed, multilingual Transformer-based
models.* models.*
Examples of use: This model was contributed by [rmroczkowski](https://huggingface.co/rmroczkowski). The original code can be found
[here](https://github.com/allegro/HerBERT).
## Usage example
```python ```python
>>> from transformers import HerbertTokenizer, RobertaModel >>> from transformers import HerbertTokenizer, RobertaModel
...@@ -56,9 +60,12 @@ Examples of use: ...@@ -56,9 +60,12 @@ Examples of use:
>>> model = AutoModel.from_pretrained("allegro/herbert-klej-cased-v1") >>> model = AutoModel.from_pretrained("allegro/herbert-klej-cased-v1")
``` ```
This model was contributed by [rmroczkowski](https://huggingface.co/rmroczkowski). The original code can be found <Tip>
[here](https://github.com/allegro/HerBERT).
Herbert implementation is the same as `BERT` except for the tokenization method. Refer to [BERT documentation](bert)
for API reference and examples.
</Tip>
## HerbertTokenizer ## HerbertTokenizer
......
...@@ -36,15 +36,15 @@ state-of-the-art wav2vec 2.0 performance on the Librispeech (960h) and Libri-lig ...@@ -36,15 +36,15 @@ state-of-the-art wav2vec 2.0 performance on the Librispeech (960h) and Libri-lig
10h, 100h, and 960h fine-tuning subsets. Using a 1B parameter model, HuBERT shows up to 19% and 13% relative WER 10h, 100h, and 960h fine-tuning subsets. Using a 1B parameter model, HuBERT shows up to 19% and 13% relative WER
reduction on the more challenging dev-other and test-other evaluation subsets.* reduction on the more challenging dev-other and test-other evaluation subsets.*
Tips: This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten).
# Usage tips
- Hubert is a speech model that accepts a float array corresponding to the raw waveform of the speech signal. - Hubert is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
- Hubert model was fine-tuned using connectionist temporal classification (CTC) so the model output has to be decoded - Hubert model was fine-tuned using connectionist temporal classification (CTC) so the model output has to be decoded
using [`Wav2Vec2CTCTokenizer`]. using [`Wav2Vec2CTCTokenizer`].
This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). ## Resources
## Documentation resources
- [Audio classification task guide](../tasks/audio_classification) - [Audio classification task guide](../tasks/audio_classification)
- [Automatic speech recognition task guide](../tasks/asr) - [Automatic speech recognition task guide](../tasks/asr)
...@@ -53,6 +53,9 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv ...@@ -53,6 +53,9 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv
[[autodoc]] HubertConfig [[autodoc]] HubertConfig
<frameworkcontent>
<pt>
## HubertModel ## HubertModel
[[autodoc]] HubertModel [[autodoc]] HubertModel
...@@ -68,6 +71,9 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv ...@@ -68,6 +71,9 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv
[[autodoc]] HubertForSequenceClassification [[autodoc]] HubertForSequenceClassification
- forward - forward
</pt>
<tf>
## TFHubertModel ## TFHubertModel
[[autodoc]] TFHubertModel [[autodoc]] TFHubertModel
...@@ -77,3 +83,6 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv ...@@ -77,3 +83,6 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv
[[autodoc]] TFHubertForCTC [[autodoc]] TFHubertForCTC
- call - call
</tf>
</frameworkcontent>
...@@ -40,7 +40,7 @@ been open-sourced.* ...@@ -40,7 +40,7 @@ been open-sourced.*
This model was contributed by [kssteven](https://huggingface.co/kssteven). The original code can be found [here](https://github.com/kssteven418/I-BERT). This model was contributed by [kssteven](https://huggingface.co/kssteven). The original code can be found [here](https://github.com/kssteven418/I-BERT).
## Documentation resources ## Resources
- [Text classification task guide](../tasks/sequence_classification) - [Text classification task guide](../tasks/sequence_classification)
- [Token classification task guide](../tasks/token_classification) - [Token classification task guide](../tasks/token_classification)
......
...@@ -31,9 +31,9 @@ This model was contributed by [HuggingFaceM4](https://huggingface.co/HuggingFace ...@@ -31,9 +31,9 @@ This model was contributed by [HuggingFaceM4](https://huggingface.co/HuggingFace
<Tip warning={true}> <Tip warning={true}>
Idefics modeling code in Transformers is for finetuning and inferencing the pre-trained Idefics models. IDEFICS modeling code in Transformers is for finetuning and inferencing the pre-trained IDEFICS models.
To train a new Idefics model from scratch use the m4 codebase (a link will be provided once it's made public) To train a new IDEFICS model from scratch use the m4 codebase (a link will be provided once it's made public)
</Tip> </Tip>
......
...@@ -40,7 +40,7 @@ alt="drawing" width="600"/> ...@@ -40,7 +40,7 @@ alt="drawing" width="600"/>
This model was contributed by [nielsr](https://huggingface.co/nielsr), based on [this issue](https://github.com/openai/image-gpt/issues/7). The original code can be found This model was contributed by [nielsr](https://huggingface.co/nielsr), based on [this issue](https://github.com/openai/image-gpt/issues/7). The original code can be found
[here](https://github.com/openai/image-gpt). [here](https://github.com/openai/image-gpt).
Tips: ## Usage tips
- ImageGPT is almost exactly the same as [GPT-2](gpt2), with the exception that a different activation - ImageGPT is almost exactly the same as [GPT-2](gpt2), with the exception that a different activation
function is used (namely "quick gelu"), and the layer normalization layers don't mean center the inputs. ImageGPT function is used (namely "quick gelu"), and the layer normalization layers don't mean center the inputs. ImageGPT
...@@ -92,7 +92,6 @@ If you're interested in submitting a resource to be included here, please feel f ...@@ -92,7 +92,6 @@ If you're interested in submitting a resource to be included here, please feel f
## ImageGPTFeatureExtractor ## ImageGPTFeatureExtractor
[[autodoc]] ImageGPTFeatureExtractor [[autodoc]] ImageGPTFeatureExtractor
- __call__ - __call__
## ImageGPTImageProcessor ## ImageGPTImageProcessor
...@@ -103,17 +102,14 @@ If you're interested in submitting a resource to be included here, please feel f ...@@ -103,17 +102,14 @@ If you're interested in submitting a resource to be included here, please feel f
## ImageGPTModel ## ImageGPTModel
[[autodoc]] ImageGPTModel [[autodoc]] ImageGPTModel
- forward - forward
## ImageGPTForCausalImageModeling ## ImageGPTForCausalImageModeling
[[autodoc]] ImageGPTForCausalImageModeling [[autodoc]] ImageGPTForCausalImageModeling
- forward - forward
## ImageGPTForImageClassification ## ImageGPTForImageClassification
[[autodoc]] ImageGPTForImageClassification [[autodoc]] ImageGPTForImageClassification
- forward - forward
...@@ -39,13 +39,11 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h ...@@ -39,13 +39,11 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
[[autodoc]] InformerConfig [[autodoc]] InformerConfig
## InformerModel ## InformerModel
[[autodoc]] InformerModel [[autodoc]] InformerModel
- forward - forward
## InformerForPrediction ## InformerForPrediction
[[autodoc]] InformerForPrediction [[autodoc]] InformerForPrediction
......
...@@ -21,10 +21,6 @@ The abstract from the paper is the following: ...@@ -21,10 +21,6 @@ The abstract from the paper is the following:
*General-purpose language models that can solve various language-domain tasks have emerged driven by the pre-training and instruction-tuning pipeline. However, building general-purpose vision-language models is challenging due to the increased task discrepancy introduced by the additional visual input. Although vision-language pre-training has been widely studied, vision-language instruction tuning remains relatively less explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pre-trained BLIP-2 models. We gather a wide variety of 26 publicly available datasets, transform them into instruction tuning format and categorize them into two clusters for held-in instruction tuning and held-out zero-shot evaluation. Additionally, we introduce instruction-aware visual feature extraction, a crucial method that enables the model to extract informative features tailored to the given instruction. The resulting InstructBLIP models achieve state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and the larger Flamingo. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA IMG). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models.* *General-purpose language models that can solve various language-domain tasks have emerged driven by the pre-training and instruction-tuning pipeline. However, building general-purpose vision-language models is challenging due to the increased task discrepancy introduced by the additional visual input. Although vision-language pre-training has been widely studied, vision-language instruction tuning remains relatively less explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pre-trained BLIP-2 models. We gather a wide variety of 26 publicly available datasets, transform them into instruction tuning format and categorize them into two clusters for held-in instruction tuning and held-out zero-shot evaluation. Additionally, we introduce instruction-aware visual feature extraction, a crucial method that enables the model to extract informative features tailored to the given instruction. The resulting InstructBLIP models achieve state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and the larger Flamingo. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA IMG). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models.*
Tips:
- InstructBLIP uses the same architecture as [BLIP-2](blip2) with a tiny but important difference: it also feeds the text prompt (instruction) to the Q-Former.
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/instructblip_architecture.jpg" <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/instructblip_architecture.jpg"
alt="drawing" width="600"/> alt="drawing" width="600"/>
...@@ -33,6 +29,9 @@ alt="drawing" width="600"/> ...@@ -33,6 +29,9 @@ alt="drawing" width="600"/>
This model was contributed by [nielsr](https://huggingface.co/nielsr). This model was contributed by [nielsr](https://huggingface.co/nielsr).
The original code can be found [here](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip). The original code can be found [here](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip).
## Usage tips
InstructBLIP uses the same architecture as [BLIP-2](blip2) with a tiny but important difference: it also feeds the text prompt (instruction) to the Q-Former.
## InstructBlipConfig ## InstructBlipConfig
......
...@@ -32,7 +32,11 @@ The metadata such as *artist, genre and timing* are passed to each prior, in the ...@@ -32,7 +32,11 @@ The metadata such as *artist, genre and timing* are passed to each prior, in the
![JukeboxModel](https://gist.githubusercontent.com/ArthurZucker/92c1acaae62ebf1b6a951710bdd8b6af/raw/c9c517bf4eff61393f6c7dec9366ef02bdd059a3/jukebox.svg) ![JukeboxModel](https://gist.githubusercontent.com/ArthurZucker/92c1acaae62ebf1b6a951710bdd8b6af/raw/c9c517bf4eff61393f6c7dec9366ef02bdd059a3/jukebox.svg)
Tips: This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ).
The original code can be found [here](https://github.com/openai/jukebox).
## Usage tips
- This model only supports inference. This is for a few reasons, mostly because it requires a crazy amount of memory to train. Feel free to open a PR and add what's missing to have a full integration with the hugging face traineer! - This model only supports inference. This is for a few reasons, mostly because it requires a crazy amount of memory to train. Feel free to open a PR and add what's missing to have a full integration with the hugging face traineer!
- This model is very slow, and takes 8h to generate a minute long audio using the 5b top prior on a V100 GPU. In order automaticallay handle the device on which the model should execute, use `accelerate`. - This model is very slow, and takes 8h to generate a minute long audio using the 5b top prior on a V100 GPU. In order automaticallay handle the device on which the model should execute, use `accelerate`.
- Contrary to the paper, the order of the priors goes from `0` to `1` as it felt more intuitive : we sample starting from `0`. - Contrary to the paper, the order of the priors goes from `0` to `1` as it felt more intuitive : we sample starting from `0`.
...@@ -67,14 +71,12 @@ The original code can be found [here](https://github.com/openai/jukebox). ...@@ -67,14 +71,12 @@ The original code can be found [here](https://github.com/openai/jukebox).
- upsample - upsample
- _sample - _sample
## JukeboxPrior ## JukeboxPrior
[[autodoc]] JukeboxPrior [[autodoc]] JukeboxPrior
- sample - sample
- forward - forward
## JukeboxVQVAE ## JukeboxVQVAE
[[autodoc]] JukeboxVQVAE [[autodoc]] JukeboxVQVAE
......
...@@ -46,7 +46,7 @@ document-level pretraining. It achieves new state-of-the-art results in several ...@@ -46,7 +46,7 @@ document-level pretraining. It achieves new state-of-the-art results in several
understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification
(from 93.07 to 94.42).* (from 93.07 to 94.42).*
Tips: ## Usage tips
- In addition to *input_ids*, [`~transformers.LayoutLMModel.forward`] also expects the input `bbox`, which are - In addition to *input_ids*, [`~transformers.LayoutLMModel.forward`] also expects the input `bbox`, which are
the bounding boxes (i.e. 2D-positions) of the input tokens. These can be obtained using an external OCR engine such the bounding boxes (i.e. 2D-positions) of the input tokens. These can be obtained using an external OCR engine such
...@@ -123,6 +123,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h ...@@ -123,6 +123,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
[[autodoc]] LayoutLMTokenizerFast [[autodoc]] LayoutLMTokenizerFast
<frameworkcontent>
<pt>
## LayoutLMModel ## LayoutLMModel
[[autodoc]] LayoutLMModel [[autodoc]] LayoutLMModel
...@@ -143,6 +146,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h ...@@ -143,6 +146,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
[[autodoc]] LayoutLMForQuestionAnswering [[autodoc]] LayoutLMForQuestionAnswering
</pt>
<tf>
## TFLayoutLMModel ## TFLayoutLMModel
[[autodoc]] TFLayoutLMModel [[autodoc]] TFLayoutLMModel
...@@ -162,3 +168,8 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h ...@@ -162,3 +168,8 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
## TFLayoutLMForQuestionAnswering ## TFLayoutLMForQuestionAnswering
[[autodoc]] TFLayoutLMForQuestionAnswering [[autodoc]] TFLayoutLMForQuestionAnswering
</tf>
</frameworkcontent>
...@@ -56,7 +56,7 @@ python -m pip install torchvision tesseract ...@@ -56,7 +56,7 @@ python -m pip install torchvision tesseract
``` ```
(If you are developing for LayoutLMv2, note that passing the doctests also requires the installation of these packages.) (If you are developing for LayoutLMv2, note that passing the doctests also requires the installation of these packages.)
Tips: ## Usage tips
- The main difference between LayoutLMv1 and LayoutLMv2 is that the latter incorporates visual embeddings during - The main difference between LayoutLMv1 and LayoutLMv2 is that the latter incorporates visual embeddings during
pre-training (while LayoutLMv1 only adds visual embeddings during fine-tuning). pre-training (while LayoutLMv1 only adds visual embeddings during fine-tuning).
......
...@@ -26,16 +26,6 @@ The abstract from the paper is the following: ...@@ -26,16 +26,6 @@ The abstract from the paper is the following:
*Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis.* *Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis.*
Tips:
- In terms of data processing, LayoutLMv3 is identical to its predecessor [LayoutLMv2](layoutlmv2), except that:
- images need to be resized and normalized with channels in regular RGB format. LayoutLMv2 on the other hand normalizes the images internally and expects the channels in BGR format.
- text is tokenized using byte-pair encoding (BPE), as opposed to WordPiece.
Due to these differences in data preprocessing, one can use [`LayoutLMv3Processor`] which internally combines a [`LayoutLMv3ImageProcessor`] (for the image modality) and a [`LayoutLMv3Tokenizer`]/[`LayoutLMv3TokenizerFast`] (for the text modality) to prepare all data for the model.
- Regarding usage of [`LayoutLMv3Processor`], we refer to the [usage guide](layoutlmv2#usage-layoutlmv2processor) of its predecessor.
- Demo notebooks for LayoutLMv3 can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LayoutLMv3).
- Demo scripts can be found [here](https://github.com/huggingface/transformers/tree/main/examples/research_projects/layoutlmv3).
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/layoutlmv3_architecture.png" <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/layoutlmv3_architecture.png"
alt="drawing" width="600"/> alt="drawing" width="600"/>
...@@ -43,6 +33,14 @@ alt="drawing" width="600"/> ...@@ -43,6 +33,14 @@ alt="drawing" width="600"/>
This model was contributed by [nielsr](https://huggingface.co/nielsr). The TensorFlow version of this model was added by [chriskoo](https://huggingface.co/chriskoo), [tokec](https://huggingface.co/tokec), and [lre](https://huggingface.co/lre). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/layoutlmv3). This model was contributed by [nielsr](https://huggingface.co/nielsr). The TensorFlow version of this model was added by [chriskoo](https://huggingface.co/chriskoo), [tokec](https://huggingface.co/tokec), and [lre](https://huggingface.co/lre). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/layoutlmv3).
## Usage tips
- In terms of data processing, LayoutLMv3 is identical to its predecessor [LayoutLMv2](layoutlmv2), except that:
- images need to be resized and normalized with channels in regular RGB format. LayoutLMv2 on the other hand normalizes the images internally and expects the channels in BGR format.
- text is tokenized using byte-pair encoding (BPE), as opposed to WordPiece.
Due to these differences in data preprocessing, one can use [`LayoutLMv3Processor`] which internally combines a [`LayoutLMv3ImageProcessor`] (for the image modality) and a [`LayoutLMv3Tokenizer`]/[`LayoutLMv3TokenizerFast`] (for the text modality) to prepare all data for the model.
- Regarding usage of [`LayoutLMv3Processor`], we refer to the [usage guide](layoutlmv2#usage-layoutlmv2processor) of its predecessor.
## Resources ## Resources
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with LayoutLMv3. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with LayoutLMv3. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
...@@ -53,6 +51,9 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2 ...@@ -53,6 +51,9 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2
</Tip> </Tip>
- Demo notebooks for LayoutLMv3 can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LayoutLMv3).
- Demo scripts can be found [here](https://github.com/huggingface/transformers/tree/main/examples/research_projects/layoutlmv3).
<PipelineTag pipeline="text-classification"/> <PipelineTag pipeline="text-classification"/>
- [`LayoutLMv2ForSequenceClassification`] is supported by this [notebook](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/RVL-CDIP/Fine_tuning_LayoutLMv2ForSequenceClassification_on_RVL_CDIP.ipynb). - [`LayoutLMv2ForSequenceClassification`] is supported by this [notebook](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/RVL-CDIP/Fine_tuning_LayoutLMv2ForSequenceClassification_on_RVL_CDIP.ipynb).
...@@ -103,6 +104,9 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2 ...@@ -103,6 +104,9 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2
[[autodoc]] LayoutLMv3Processor [[autodoc]] LayoutLMv3Processor
- __call__ - __call__
<frameworkcontent>
<pt>
## LayoutLMv3Model ## LayoutLMv3Model
[[autodoc]] LayoutLMv3Model [[autodoc]] LayoutLMv3Model
...@@ -123,6 +127,9 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2 ...@@ -123,6 +127,9 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2
[[autodoc]] LayoutLMv3ForQuestionAnswering [[autodoc]] LayoutLMv3ForQuestionAnswering
- forward - forward
</pt>
<tf>
## TFLayoutLMv3Model ## TFLayoutLMv3Model
[[autodoc]] TFLayoutLMv3Model [[autodoc]] TFLayoutLMv3Model
...@@ -142,3 +149,6 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2 ...@@ -142,3 +149,6 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2
[[autodoc]] TFLayoutLMv3ForQuestionAnswering [[autodoc]] TFLayoutLMv3ForQuestionAnswering
- call - call
</tf>
</frameworkcontent>
...@@ -33,6 +33,10 @@ introduce a multilingual form understanding benchmark dataset named XFUN, which ...@@ -33,6 +33,10 @@ introduce a multilingual form understanding benchmark dataset named XFUN, which
for each language. Experiment results show that the LayoutXLM model has significantly outperformed the existing SOTA for each language. Experiment results show that the LayoutXLM model has significantly outperformed the existing SOTA
cross-lingual pre-trained models on the XFUN dataset.* cross-lingual pre-trained models on the XFUN dataset.*
This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/microsoft/unilm).
## Usage tips and examples
One can directly plug in the weights of LayoutXLM into a LayoutLMv2 model, like so: One can directly plug in the weights of LayoutXLM into a LayoutLMv2 model, like so:
```python ```python
...@@ -56,10 +60,10 @@ Similar to LayoutLMv2, you can use [`LayoutXLMProcessor`] (which internally appl ...@@ -56,10 +60,10 @@ Similar to LayoutLMv2, you can use [`LayoutXLMProcessor`] (which internally appl
[`LayoutXLMTokenizer`]/[`LayoutXLMTokenizerFast`] in sequence) to prepare all [`LayoutXLMTokenizer`]/[`LayoutXLMTokenizerFast`] in sequence) to prepare all
data for the model. data for the model.
As LayoutXLM's architecture is equivalent to that of LayoutLMv2, one can refer to [LayoutLMv2's documentation page](layoutlmv2) for all tips, code examples and notebooks. <Tip>
This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/microsoft/unilm).
As LayoutXLM's architecture is equivalent to that of LayoutLMv2, one can refer to [LayoutLMv2's documentation page](layoutlmv2) for all tips, code examples and notebooks.
</Tip>
## LayoutXLMTokenizer ## LayoutXLMTokenizer
......
...@@ -35,7 +35,7 @@ WikiHop and TriviaQA. We finally introduce the Longformer-Encoder-Decoder (LED), ...@@ -35,7 +35,7 @@ WikiHop and TriviaQA. We finally introduce the Longformer-Encoder-Decoder (LED),
long document generative sequence-to-sequence tasks, and demonstrate its effectiveness on the arXiv summarization long document generative sequence-to-sequence tasks, and demonstrate its effectiveness on the arXiv summarization
dataset.* dataset.*
Tips: ## Usage tips
- [`LEDForConditionalGeneration`] is an extension of - [`LEDForConditionalGeneration`] is an extension of
[`BartForConditionalGeneration`] exchanging the traditional *self-attention* layer with [`BartForConditionalGeneration`] exchanging the traditional *self-attention* layer with
...@@ -52,15 +52,15 @@ Tips: ...@@ -52,15 +52,15 @@ Tips:
errors. This can be done by executing `model.gradient_checkpointing_enable()`. errors. This can be done by executing `model.gradient_checkpointing_enable()`.
Moreover, the `use_cache=False` Moreover, the `use_cache=False`
flag can be used to disable the caching mechanism to save memory. flag can be used to disable the caching mechanism to save memory.
- A notebook showing how to evaluate LED, can be accessed [here](https://colab.research.google.com/drive/12INTTR6n64TzS4RrXZxMSXfrOd9Xzamo?usp=sharing).
- A notebook showing how to fine-tune LED, can be accessed [here](https://colab.research.google.com/drive/12LjJazBl7Gam0XBPy_y0CTOJZeZ34c2v?usp=sharing).
- LED is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than - LED is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
the left. the left.
This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten).
## Documentation resources ## Resources
- [A notebook showing how to evaluate LED](https://colab.research.google.com/drive/12INTTR6n64TzS4RrXZxMSXfrOd9Xzamo?usp=sharing).
- [A notebook showing how to fine-tune LED](https://colab.research.google.com/drive/12LjJazBl7Gam0XBPy_y0CTOJZeZ34c2v?usp=sharing).
- [Text classification task guide](../tasks/sequence_classification) - [Text classification task guide](../tasks/sequence_classification)
- [Question answering task guide](../tasks/question_answering) - [Question answering task guide](../tasks/question_answering)
- [Translation task guide](../tasks/translation) - [Translation task guide](../tasks/translation)
...@@ -100,6 +100,9 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv ...@@ -100,6 +100,9 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv
[[autodoc]] models.led.modeling_tf_led.TFLEDSeq2SeqLMOutput [[autodoc]] models.led.modeling_tf_led.TFLEDSeq2SeqLMOutput
<frameworkcontent>
<pt>
## LEDModel ## LEDModel
[[autodoc]] LEDModel [[autodoc]] LEDModel
...@@ -120,6 +123,9 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv ...@@ -120,6 +123,9 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv
[[autodoc]] LEDForQuestionAnswering [[autodoc]] LEDForQuestionAnswering
- forward - forward
</pt>
<tf>
## TFLEDModel ## TFLEDModel
[[autodoc]] TFLEDModel [[autodoc]] TFLEDModel
...@@ -129,3 +135,9 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv ...@@ -129,3 +135,9 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv
[[autodoc]] TFLEDForConditionalGeneration [[autodoc]] TFLEDForConditionalGeneration
- call - call
</tf>
</frameworkcontent>
...@@ -38,7 +38,9 @@ alt="drawing" width="600"/> ...@@ -38,7 +38,9 @@ alt="drawing" width="600"/>
<small> LeViT Architecture. Taken from the <a href="https://arxiv.org/abs/2104.01136">original paper</a>.</small> <small> LeViT Architecture. Taken from the <a href="https://arxiv.org/abs/2104.01136">original paper</a>.</small>
Tips: This model was contributed by [anugunj](https://huggingface.co/anugunj). The original code can be found [here](https://github.com/facebookresearch/LeViT).
## Usage tips
- Compared to ViT, LeViT models use an additional distillation head to effectively learn from a teacher (which, in the LeViT paper, is a ResNet like-model). The distillation head is learned through backpropagation under supervision of a ResNet like-model. They also draw inspiration from convolution neural networks to use activation maps with decreasing resolutions to increase the efficiency. - Compared to ViT, LeViT models use an additional distillation head to effectively learn from a teacher (which, in the LeViT paper, is a ResNet like-model). The distillation head is learned through backpropagation under supervision of a ResNet like-model. They also draw inspiration from convolution neural networks to use activation maps with decreasing resolutions to increase the efficiency.
- There are 2 ways to fine-tune distilled models, either (1) in a classic way, by only placing a prediction head on top - There are 2 ways to fine-tune distilled models, either (1) in a classic way, by only placing a prediction head on top
...@@ -63,8 +65,6 @@ Tips: ...@@ -63,8 +65,6 @@ Tips:
- You can check out demo notebooks regarding inference as well as fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer) - You can check out demo notebooks regarding inference as well as fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer)
(you can just replace [`ViTFeatureExtractor`] by [`LevitImageProcessor`] and [`ViTForImageClassification`] by [`LevitForImageClassification`] or [`LevitForImageClassificationWithTeacher`]). (you can just replace [`ViTFeatureExtractor`] by [`LevitImageProcessor`] and [`ViTForImageClassification`] by [`LevitForImageClassification`] or [`LevitForImageClassificationWithTeacher`]).
This model was contributed by [anugunj](https://huggingface.co/anugunj). The original code can be found [here](https://github.com/facebookresearch/LeViT).
## Resources ## Resources
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with LeViT. A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with LeViT.
...@@ -90,7 +90,6 @@ If you're interested in submitting a resource to be included here, please feel f ...@@ -90,7 +90,6 @@ If you're interested in submitting a resource to be included here, please feel f
[[autodoc]] LevitImageProcessor [[autodoc]] LevitImageProcessor
- preprocess - preprocess
## LevitModel ## LevitModel
[[autodoc]] LevitModel [[autodoc]] LevitModel
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment