Unverified Commit 207594be authored by Sylvain Gugger's avatar Sylvain Gugger Committed by GitHub
Browse files

Convert rst files (#14888)

* Convert all tutorials and guides

* Convert all remaining rst to mdx

* Track and fix bad links
parent b0c7d2ec
..
Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
Speech2Text
-----------------------------------------------------------------------------------------------------------------------
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The Speech2Text model was proposed in `fairseq S2T: Fast Speech-to-Text Modeling with fairseq
<https://arxiv.org/abs/2010.05171>`__ by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino. It's a
transformer-based seq2seq (encoder-decoder) model designed for end-to-end Automatic Speech Recognition (ASR) and Speech
Translation (ST). It uses a convolutional downsampler to reduce the length of speech inputs by 3/4th before they are
fed into the encoder. The model is trained with standard autoregressive cross-entropy loss and generates the
transcripts/translations autoregressively. Speech2Text has been fine-tuned on several datasets for ASR and ST:
`LibriSpeech <http://www.openslr.org/12>`__, `CoVoST 2 <https://github.com/facebookresearch/covost>`__, `MuST-C
<https://ict.fbk.eu/must-c/>`__.
This model was contributed by `valhalla <https://huggingface.co/valhalla>`__. The original code can be found `here
<https://github.com/pytorch/fairseq/tree/master/examples/speech_to_text>`__.
Inference
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Speech2Text is a speech model that accepts a float tensor of log-mel filter-bank features extracted from the speech
signal. It's a transformer-based seq2seq model, so the transcripts/translations are generated autoregressively. The
:obj:`generate()` method can be used for inference.
The :class:`~transformers.Speech2TextFeatureExtractor` class is responsible for extracting the log-mel filter-bank
features. The :class:`~transformers.Speech2TextProcessor` wraps :class:`~transformers.Speech2TextFeatureExtractor` and
:class:`~transformers.Speech2TextTokenizer` into a single instance to both extract the input features and decode the
predicted token ids.
The feature extractor depends on :obj:`torchaudio` and the tokenizer depends on :obj:`sentencepiece` so be sure to
install those packages before running the examples. You could either install those as extra speech dependencies with
``pip install transformers"[speech, sentencepiece]"`` or install the packages seperately with ``pip install torchaudio
sentencepiece``. Also ``torchaudio`` requires the development version of the `libsndfile
<http://www.mega-nerd.com/libsndfile/>`__ package which can be installed via a system package manager. On Ubuntu it can
be installed as follows: ``apt install libsndfile1-dev``
- ASR and Speech Translation
.. code-block::
>>> import torch
>>> from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration
>>> from datasets import load_dataset
>>> import soundfile as sf
>>> model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-librispeech-asr")
>>> processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr")
>>> def map_to_array(batch):
... speech, _ = sf.read(batch["file"])
... batch["speech"] = speech
... return batch
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> ds = ds.map(map_to_array)
>>> inputs = processor(ds["speech"][0], sampling_rate=16_000, return_tensors="pt")
>>> generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask"])
>>> transcription = processor.batch_decode(generated_ids)
- Multilingual speech translation
For multilingual speech translation models, :obj:`eos_token_id` is used as the :obj:`decoder_start_token_id` and
the target language id is forced as the first generated token. To force the target language id as the first
generated token, pass the :obj:`forced_bos_token_id` parameter to the :obj:`generate()` method. The following
example shows how to transate English speech to French text using the `facebook/s2t-medium-mustc-multilingual-st`
checkpoint.
.. code-block::
>>> import torch
>>> from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration
>>> from datasets import load_dataset
>>> import soundfile as sf
>>> model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-medium-mustc-multilingual-st")
>>> processor = Speech2TextProcessor.from_pretrained("facebook/s2t-medium-mustc-multilingual-st")
>>> def map_to_array(batch):
... speech, _ = sf.read(batch["file"])
... batch["speech"] = speech
... return batch
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> ds = ds.map(map_to_array)
>>> inputs = processor(ds["speech"][0], sampling_rate=16_000, return_tensors="pt")
>>> generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask], forced_bos_token_id=processor.tokenizer.lang_code_to_id["fr"])
>>> translation = processor.batch_decode(generated_ids)
See the `model hub <https://huggingface.co/models?filter=speech_to_text>`__ to look for Speech2Text checkpoints.
Speech2TextConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.Speech2TextConfig
:members:
Speech2TextTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.Speech2TextTokenizer
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
create_token_type_ids_from_sequences, save_vocabulary
Speech2TextFeatureExtractor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.Speech2TextFeatureExtractor
:members: __call__
Speech2TextProcessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.Speech2TextProcessor
:members: __call__, from_pretrained, save_pretrained, batch_decode, decode, as_target_processor
Speech2TextModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.Speech2TextModel
:members: forward
Speech2TextForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.Speech2TextForConditionalGeneration
:members: forward
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Speech2Text2
## Overview
The Speech2Text2 model is used together with [Wav2Vec2](wav2vec2) for Speech Translation models proposed in
[Large-Scale Self- and Semi-Supervised Learning for Speech Translation](https://arxiv.org/abs/2104.06678) by
Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.
Speech2Text2 is a *decoder-only* transformer model that can be used with any speech *encoder-only*, such as
[Wav2Vec2](wav2vec2) or [HuBERT](hubert) for Speech-to-Text tasks. Please refer to the
[SpeechEncoderDecoder](speechencoderdecoder) class on how to combine Speech2Text2 with any speech *encoder-only*
model.
This model was contributed by [Patrick von Platen](https://huggingface.co/patrickvonplaten).
The original code can be found [here](https://github.com/pytorch/fairseq/blob/1f7ef9ed1e1061f8c7f88f8b94c7186834398690/fairseq/models/wav2vec/wav2vec2_asr.py#L266).
Tips:
- Speech2Text2 achieves state-of-the-art results on the CoVoST Speech Translation dataset. For more information, see
the [official models](https://huggingface.co/models?other=speech2text2) .
- Speech2Text2 is always used within the [SpeechEncoderDecoder](speechencoderdecoder) framework.
- Speech2Text2's tokenizer is based on [fastBPE](https://github.com/glample/fastBPE).
## Inference
Speech2Text2's [`SpeechEncoderDecoderModel`] model accepts raw waveform input values from speech and
makes use of [`~generation_utils.GenerationMixin.generate`] to translate the input speech
autoregressively to the target language.
The [`Wav2Vec2FeatureExtractor`] class is responsible for preprocessing the input speech and
[`Speech2Text2Tokenizer`] decodes the generated target tokens to the target string. The
[`Speech2Text2Processor`] wraps [`Wav2Vec2FeatureExtractor`] and
[`Speech2Text2Tokenizer`] into a single instance to both extract the input features and decode the
predicted token ids.
- Step-by-step Speech Translation
```python
>>> import torch
>>> from transformers import Speech2Text2Processor, SpeechEncoderDecoderModel
>>> from datasets import load_dataset
>>> import soundfile as sf
>>> model = SpeechEncoderDecoderModel.from_pretrained("facebook/s2t-wav2vec2-large-en-de")
>>> processor = Speech2Text2Processor.from_pretrained("facebook/s2t-wav2vec2-large-en-de")
>>> def map_to_array(batch):
... speech, _ = sf.read(batch["file"])
... batch["speech"] = speech
... return batch
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> ds = ds.map(map_to_array)
>>> inputs = processor(ds["speech"][0], sampling_rate=16_000, return_tensors="pt")
>>> generated_ids = model.generate(input_ids=inputs["input_values"], attention_mask=inputs["attention_mask"])
>>> transcription = processor.batch_decode(generated_ids)
```
- Speech Translation via Pipelines
The automatic speech recognition pipeline can also be used to translate speech in just a couple lines of code
```python
>>> from datasets import load_dataset
>>> from transformers import pipeline
>>> librispeech_en = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> asr = pipeline("automatic-speech-recognition", model="facebook/s2t-wav2vec2-large-en-de", feature_extractor="facebook/s2t-wav2vec2-large-en-de")
>>> translation_de = asr(librispeech_en[0]["file"])
```
See [model hub](https://huggingface.co/models?filter=speech2text2) to look for Speech2Text2 checkpoints.
## Speech2Text2Config
[[autodoc]] Speech2Text2Config
## Speech2TextTokenizer
[[autodoc]] Speech2Text2Tokenizer
- batch_decode
- decode
- save_vocabulary
## Speech2Text2Processor
[[autodoc]] Speech2Text2Processor
- __call__
- from_pretrained
- save_pretrained
- batch_decode
- decode
- as_target_processor
## Speech2Text2ForCausalLM
[[autodoc]] Speech2Text2ForCausalLM
- forward
..
Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
Speech2Text2
-----------------------------------------------------------------------------------------------------------------------
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The Speech2Text2 model is used together with :doc:`Wav2Vec2 <wav2vec2>` for Speech Translation models proposed in
`Large-Scale Self- and Semi-Supervised Learning for Speech Translation <https://arxiv.org/abs/2104.06678>`__ by
Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.
Speech2Text2 is a *decoder-only* transformer model that can be used with any speech *encoder-only*, such as
:doc:`Wav2Vec2 <wav2vec2>` or :doc:`HuBERT <hubert>` for Speech-to-Text tasks. Please refer to the
:doc:`SpeechEncoderDecoder <speechencoderdecoder>` class on how to combine Speech2Text2 with any speech *encoder-only*
model.
This model was contributed by `Patrick von Platen <https://huggingface.co/patrickvonplaten>`__.
The original code can be found `here
<https://github.com/pytorch/fairseq/blob/1f7ef9ed1e1061f8c7f88f8b94c7186834398690/fairseq/models/wav2vec/wav2vec2_asr.py#L266>`__.
Tips:
- Speech2Text2 achieves state-of-the-art results on the CoVoST Speech Translation dataset. For more information, see
the `official models <https://huggingface.co/models?other=speech2text2>`__ .
- Speech2Text2 is always used within the :doc:`SpeechEncoderDecoder <speechencoderdecoder>` framework.
- Speech2Text2's tokenizer is based on `fastBPE <https://github.com/glample/fastBPE>`.
Inference
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Speech2Text2's :class:`~transformers.SpeechEncoderDecoderModel` model accepts raw waveform input values from speech and
makes use of :func:`~transformers.generation_utils.GenerationMixin.generate` to translate the input speech
autoregressively to the target language.
The :class:`~transformers.Wav2Vec2FeatureExtractor` class is responsible for preprocessing the input speech and
:class:`~transformers.Speech2Text2Tokenizer` decodes the generated target tokens to the target string. The
:class:`~transformers.Speech2Text2Processor` wraps :class:`~transformers.Wav2Vec2FeatureExtractor` and
:class:`~transformers.Speech2Text2Tokenizer` into a single instance to both extract the input features and decode the
predicted token ids.
- Step-by-step Speech Translation
.. code-block::
>>> import torch
>>> from transformers import Speech2Text2Processor, SpeechEncoderDecoderModel
>>> from datasets import load_dataset
>>> import soundfile as sf
>>> model = SpeechEncoderDecoderModel.from_pretrained("facebook/s2t-wav2vec2-large-en-de")
>>> processor = Speech2Text2Processor.from_pretrained("facebook/s2t-wav2vec2-large-en-de")
>>> def map_to_array(batch):
... speech, _ = sf.read(batch["file"])
... batch["speech"] = speech
... return batch
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> ds = ds.map(map_to_array)
>>> inputs = processor(ds["speech"][0], sampling_rate=16_000, return_tensors="pt")
>>> generated_ids = model.generate(input_ids=inputs["input_values"], attention_mask=inputs["attention_mask"])
>>> transcription = processor.batch_decode(generated_ids)
- Speech Translation via Pipelines
The automatic speech recognition pipeline can also be used to translate speech in just a couple lines of code
.. code-block::
>>> from datasets import load_dataset
>>> from transformers import pipeline
>>> librispeech_en = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> asr = pipeline("automatic-speech-recognition", model="facebook/s2t-wav2vec2-large-en-de", feature_extractor="facebook/s2t-wav2vec2-large-en-de")
>>> translation_de = asr(librispeech_en[0]["file"])
See `model hub <https://huggingface.co/models?filter=speech2text2>`__ to look for Speech2Text2 checkpoints.
Speech2Text2Config
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.Speech2Text2Config
:members:
Speech2TextTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.Speech2Text2Tokenizer
:members: batch_decode, decode, save_vocabulary
Speech2Text2Processor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.Speech2Text2Processor
:members: __call__, from_pretrained, save_pretrained, batch_decode, decode, as_target_processor
Speech2Text2ForCausalLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.Speech2Text2ForCausalLM
:members: forward
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Speech Encoder Decoder Models
The [`SpeechEncoderDecoderModel`] can be used to initialize a speech-sequence-to-text-sequence model
with any pretrained speech autoencoding model as the encoder (*e.g.* [Wav2Vec2](wav2vec2), [Hubert](hubert)) and any pretrained autoregressive model as the decoder.
The effectiveness of initializing speech-sequence-to-text-sequence models with pretrained checkpoints for speech
recognition and speech translation has *e.g.* been shown in [Large-Scale Self- and Semi-Supervised Learning for Speech
Translation](https://arxiv.org/abs/2104.06678) by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli,
Alexis Conneau.
An example of how to use a [`SpeechEncoderDecoderModel`] for inference can be seen in
[Speech2Text2](speech_to_text_2).
## SpeechEncoderDecoderConfig
[[autodoc]] SpeechEncoderDecoderConfig
## SpeechEncoderDecoderModel
[[autodoc]] SpeechEncoderDecoderModel
- forward
- from_encoder_decoder_pretrained
..
Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
Speech Encoder Decoder Models
-----------------------------------------------------------------------------------------------------------------------
The :class:`~transformers.SpeechEncoderDecoderModel` can be used to initialize a speech-sequence-to-text-sequence model
with any pretrained speech autoencoding model as the encoder (*e.g.* :doc:`Wav2Vec2 <wav2vec2>`, :doc:`Hubert
<hubert>`) and any pretrained autoregressive model as the decoder.
The effectiveness of initializing speech-sequence-to-text-sequence models with pretrained checkpoints for speech
recognition and speech translation has *e.g.* been shown in `Large-Scale Self- and Semi-Supervised Learning for Speech
Translation <https://arxiv.org/abs/2104.06678>`__ by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli,
Alexis Conneau.
An example of how to use a :class:`~transformers.SpeechEncoderDecoderModel` for inference can be seen in
:doc:`Speech2Text2 <speech_to_text_2>`.
SpeechEncoderDecoderConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.SpeechEncoderDecoderConfig
:members:
SpeechEncoderDecoderModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.SpeechEncoderDecoderModel
:members: forward, from_encoder_decoder_pretrained
.. <!--Copyright 2021 The HuggingFace Team. All rights reserved.
Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. specific language governing permissions and limitations under the License.
-->
Splinter # Splinter
-----------------------------------------------------------------------------------------------------------------------
Overview ## Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The Splinter model was proposed in `Few-Shot Question Answering by Pretraining Span Selection The Splinter model was proposed in [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) by Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy. Splinter
<https://arxiv.org/abs/2101.00438>`__ by Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy. Splinter
is an encoder-only transformer (similar to BERT) pretrained using the recurring span selection task on a large corpus is an encoder-only transformer (similar to BERT) pretrained using the recurring span selection task on a large corpus
comprising Wikipedia and the Toronto Book Corpus. comprising Wikipedia and the Toronto Book Corpus.
...@@ -37,51 +34,41 @@ Tips: ...@@ -37,51 +34,41 @@ Tips:
- Splinter was trained to predict answers spans conditioned on a special [QUESTION] token. These tokens contextualize - Splinter was trained to predict answers spans conditioned on a special [QUESTION] token. These tokens contextualize
to question representations which are used to predict the answers. This layer is called QASS, and is the default to question representations which are used to predict the answers. This layer is called QASS, and is the default
behaviour in the :class:`~transformers.SplinterForQuestionAnswering` class. Therefore: behaviour in the [`SplinterForQuestionAnswering`] class. Therefore:
- Use :class:`~transformers.SplinterTokenizer` (rather than :class:`~transformers.BertTokenizer`), as it already - Use [`SplinterTokenizer`] (rather than [`BertTokenizer`]), as it already
contains this special token. Also, its default behavior is to use this token when two sequences are given (for contains this special token. Also, its default behavior is to use this token when two sequences are given (for
example, in the `run_qa.py` script). example, in the *run_qa.py* script).
- If you plan on using Splinter outside `run_qa.py`, please keep in mind the question token - it might be important for - If you plan on using Splinter outside *run_qa.py*, please keep in mind the question token - it might be important for
the success of your model, especially in a few-shot setting. the success of your model, especially in a few-shot setting.
- Please note there are two different checkpoints for each size of Splinter. Both are basically the same, except that - Please note there are two different checkpoints for each size of Splinter. Both are basically the same, except that
one also has the pretrained wights of the QASS layer (`tau/splinter-base-qass` and `tau/splinter-large-qass`) and one one also has the pretrained wights of the QASS layer (*tau/splinter-base-qass* and *tau/splinter-large-qass*) and one
doesn't (`tau/splinter-base` and `tau/splinter-large`). This is done to support randomly initializing this layer at doesn't (*tau/splinter-base* and *tau/splinter-large*). This is done to support randomly initializing this layer at
fine-tuning, as it is shown to yield better results for some cases in the paper. fine-tuning, as it is shown to yield better results for some cases in the paper.
This model was contributed by `yuvalkirstain <https://huggingface.co/yuvalkirstain>`__ and `oriram This model was contributed by [yuvalkirstain](https://huggingface.co/yuvalkirstain) and [oriram](https://huggingface.co/oriram). The original code can be found [here](https://github.com/oriram/splinter).
<https://huggingface.co/oriram>`__. The original code can be found `here <https://github.com/oriram/splinter>`__.
SplinterConfig ## SplinterConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.SplinterConfig [[autodoc]] SplinterConfig
:members:
## SplinterTokenizer
SplinterTokenizer [[autodoc]] SplinterTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - build_inputs_with_special_tokens
- get_special_tokens_mask
- create_token_type_ids_from_sequences
- save_vocabulary
.. autoclass:: transformers.SplinterTokenizer ## SplinterTokenizerFast
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
create_token_type_ids_from_sequences, save_vocabulary
[[autodoc]] SplinterTokenizerFast
SplinterTokenizerFast ## SplinterModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.SplinterTokenizerFast [[autodoc]] SplinterModel
:members: - forward
## SplinterForQuestionAnswering
SplinterModel [[autodoc]] SplinterForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - forward
.. autoclass:: transformers.SplinterModel
:members: forward
SplinterForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.SplinterForQuestionAnswering
:members: forward
.. <!--Copyright 2020 The HuggingFace Team. All rights reserved.
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. specific language governing permissions and limitations under the License.
-->
SqueezeBERT # SqueezeBERT
-----------------------------------------------------------------------------------------------------------------------
Overview ## Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The SqueezeBERT model was proposed in `SqueezeBERT: What can computer vision teach NLP about efficient neural networks? The SqueezeBERT model was proposed in [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, Kurt W. Keutzer. It's a
<https://arxiv.org/abs/2006.11316>`__ by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, Kurt W. Keutzer. It's a
bidirectional transformer similar to the BERT model. The key difference between the BERT architecture and the bidirectional transformer similar to the BERT model. The key difference between the BERT architecture and the
SqueezeBERT architecture is that SqueezeBERT uses `grouped convolutions <https://blog.yani.io/filter-group-tutorial>`__ SqueezeBERT architecture is that SqueezeBERT uses [grouped convolutions](https://blog.yani.io/filter-group-tutorial)
instead of fully-connected layers for the Q, K, V and FFN layers. instead of fully-connected layers for the Q, K, V and FFN layers.
The abstract from the paper is the following: The abstract from the paper is the following:
...@@ -45,70 +42,47 @@ Tips: ...@@ -45,70 +42,47 @@ Tips:
efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. Models trained efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. Models trained
with a causal language modeling (CLM) objective are better in that regard. with a causal language modeling (CLM) objective are better in that regard.
- For best results when finetuning on sequence classification tasks, it is recommended to start with the - For best results when finetuning on sequence classification tasks, it is recommended to start with the
`squeezebert/squeezebert-mnli-headless` checkpoint. *squeezebert/squeezebert-mnli-headless* checkpoint.
This model was contributed by `forresti <https://huggingface.co/forresti>`__. This model was contributed by [forresti](https://huggingface.co/forresti).
SqueezeBertConfig ## SqueezeBertConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.SqueezeBertConfig [[autodoc]] SqueezeBertConfig
:members:
## SqueezeBertTokenizer
SqueezeBertTokenizer [[autodoc]] SqueezeBertTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - build_inputs_with_special_tokens
- get_special_tokens_mask
- create_token_type_ids_from_sequences
- save_vocabulary
.. autoclass:: transformers.SqueezeBertTokenizer ## SqueezeBertTokenizerFast
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
create_token_type_ids_from_sequences, save_vocabulary
[[autodoc]] SqueezeBertTokenizerFast
SqueezeBertTokenizerFast ## SqueezeBertModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.SqueezeBertTokenizerFast [[autodoc]] SqueezeBertModel
:members:
## SqueezeBertForMaskedLM
SqueezeBertModel [[autodoc]] SqueezeBertForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.SqueezeBertModel ## SqueezeBertForSequenceClassification
:members:
[[autodoc]] SqueezeBertForSequenceClassification
SqueezeBertForMaskedLM ## SqueezeBertForMultipleChoice
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.SqueezeBertForMaskedLM [[autodoc]] SqueezeBertForMultipleChoice
:members:
## SqueezeBertForTokenClassification
SqueezeBertForSequenceClassification [[autodoc]] SqueezeBertForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.SqueezeBertForSequenceClassification ## SqueezeBertForQuestionAnswering
:members:
[[autodoc]] SqueezeBertForQuestionAnswering
SqueezeBertForMultipleChoice
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.SqueezeBertForMultipleChoice
:members:
SqueezeBertForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.SqueezeBertForTokenClassification
:members:
SqueezeBertForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.SqueezeBertForQuestionAnswering
:members:
.. <!--Copyright 2020 The HuggingFace Team. All rights reserved.
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. specific language governing permissions and limitations under the License.
-->
T5 # T5
-----------------------------------------------------------------------------------------------------------------------
Overview ## Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The T5 model was presented in `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer The T5 model was presented in [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf) by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
<https://arxiv.org/pdf/1910.10683.pdf>`_ by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.
The abstract from the paper is the following: The abstract from the paper is the following:
...@@ -41,76 +38,73 @@ Tips: ...@@ -41,76 +38,73 @@ Tips:
- T5 uses relative scalar embeddings. Encoder input padding can be done on the left and on the right. - T5 uses relative scalar embeddings. Encoder input padding can be done on the left and on the right.
- See the :ref:`training`, :ref:`inference` and :ref:`scripts` sections below for all details regarding usage. - See the [training](#training), [inference](#inference) and [scripts](#scripts) sections below for all details regarding usage.
T5 comes in different sizes: T5 comes in different sizes:
- `t5-small <https://huggingface.co/t5-small>`__ - [t5-small](https://huggingface.co/t5-small)
- `t5-base <https://huggingface.co/t5-base>`__ - [t5-base](https://huggingface.co/t5-base)
- `t5-large <https://huggingface.co/t5-large>`__ - [t5-large](https://huggingface.co/t5-large)
- `t5-3b <https://huggingface.co/t5-3b>`__ - [t5-3b](https://huggingface.co/t5-3b)
- `t5-11b <https://huggingface.co/t5-11b>`__. - [t5-11b](https://huggingface.co/t5-11b).
Based on the original T5 model, Google has released some follow-up works: Based on the original T5 model, Google has released some follow-up works:
- **T5v1.1**: T5v1.1 is an improved version of T5 with some architectural tweaks, and is pre-trained on C4 only without - **T5v1.1**: T5v1.1 is an improved version of T5 with some architectural tweaks, and is pre-trained on C4 only without
mixing in the supervised tasks. Refer to the documentation of T5v1.1 which can be found :doc:`here <t5v1.1>`. mixing in the supervised tasks. Refer to the documentation of T5v1.1 which can be found [here](t5v1.1).
- **mT5**: mT5 is a multilingual T5 model. It is pre-trained on the mC4 corpus, which includes 101 languages. Refer to - **mT5**: mT5 is a multilingual T5 model. It is pre-trained on the mC4 corpus, which includes 101 languages. Refer to
the documentation of mT5 which can be found :doc:`here <mt5>`. the documentation of mT5 which can be found [here](mt5).
- **byT5**: byT5 is a T5 model pre-trained on byte sequences rather than SentencePiece subword token sequences. Refer - **byT5**: byT5 is a T5 model pre-trained on byte sequences rather than SentencePiece subword token sequences. Refer
to the documentation of byT5 which can be found :doc:`here <byt5>`. to the documentation of byT5 which can be found [here](byt5).
All checkpoints can be found on the `hub <https://huggingface.co/models?search=t5>`__. All checkpoints can be found on the [hub](https://huggingface.co/models?search=t5).
This model was contributed by `thomwolf <https://huggingface.co/thomwolf>`__. The original code can be found `here This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/google-research/text-to-text-transfer-transformer).
<https://github.com/google-research/text-to-text-transfer-transformer>`__.
.. _training: <a id='training'></a>
Training ## Training
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
T5 is an encoder-decoder model and converts all NLP problems into a text-to-text format. It is trained using teacher T5 is an encoder-decoder model and converts all NLP problems into a text-to-text format. It is trained using teacher
forcing. This means that for training, we always need an input sequence and a corresponding target sequence. The input forcing. This means that for training, we always need an input sequence and a corresponding target sequence. The input
sequence is fed to the model using :obj:`input_ids`. The target sequence is shifted to the right, i.e., prepended by a sequence is fed to the model using `input_ids`. The target sequence is shifted to the right, i.e., prepended by a
start-sequence token and fed to the decoder using the :obj:`decoder_input_ids`. In teacher-forcing style, the target start-sequence token and fed to the decoder using the `decoder_input_ids`. In teacher-forcing style, the target
sequence is then appended by the EOS token and corresponds to the :obj:`labels`. The PAD token is hereby used as the sequence is then appended by the EOS token and corresponds to the `labels`. The PAD token is hereby used as the
start-sequence token. T5 can be trained / fine-tuned both in a supervised and unsupervised fashion. start-sequence token. T5 can be trained / fine-tuned both in a supervised and unsupervised fashion.
One can use :class:`~transformers.T5ForConditionalGeneration` (or the Tensorflow/Flax variant), which includes the One can use [`T5ForConditionalGeneration`] (or the Tensorflow/Flax variant), which includes the
language modeling head on top of the decoder. language modeling head on top of the decoder.
- Unsupervised denoising training - Unsupervised denoising training
In this setup, spans of the input sequence are masked by so-called sentinel tokens (*a.k.a* unique mask tokens) and In this setup, spans of the input sequence are masked by so-called sentinel tokens (*a.k.a* unique mask tokens) and
the output sequence is formed as a concatenation of the same sentinel tokens and the *real* masked tokens. Each the output sequence is formed as a concatenation of the same sentinel tokens and the *real* masked tokens. Each
sentinel token represents a unique mask token for this sentence and should start with :obj:`<extra_id_0>`, sentinel token represents a unique mask token for this sentence and should start with `<extra_id_0>`,
:obj:`<extra_id_1>`, ... up to :obj:`<extra_id_99>`. As a default, 100 sentinel tokens are available in `<extra_id_1>`, ... up to `<extra_id_99>`. As a default, 100 sentinel tokens are available in
:class:`~transformers.T5Tokenizer`. [`T5Tokenizer`].
For instance, the sentence "The cute dog walks in the park" with the masks put on "cute dog" and "the" should be For instance, the sentence "The cute dog walks in the park" with the masks put on "cute dog" and "the" should be
processed as follows: processed as follows:
.. code-block:: ```python
from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import T5Tokenizer, T5ForConditionalGeneration tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")
tokenizer = T5Tokenizer.from_pretrained("t5-small") input_ids = tokenizer('The <extra_id_0> walks in <extra_id_1> park', return_tensors='pt').input_ids
model = T5ForConditionalGeneration.from_pretrained("t5-small") labels = tokenizer('<extra_id_0> cute dog <extra_id_1> the <extra_id_2>', return_tensors='pt').input_ids
# the forward function automatically creates the correct decoder_input_ids
loss = model(input_ids=input_ids, labels=labels).loss
```
input_ids = tokenizer('The <extra_id_0> walks in <extra_id_1> park', return_tensors='pt').input_ids If you're interested in pre-training T5 on a new corpus, check out the [run_t5_mlm_flax.py](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling) script in the Examples
labels = tokenizer('<extra_id_0> cute dog <extra_id_1> the <extra_id_2>', return_tensors='pt').input_ids
# the forward function automatically creates the correct decoder_input_ids
loss = model(input_ids=input_ids, labels=labels).loss
If you're interested in pre-training T5 on a new corpus, check out the `run_t5_mlm_flax.py
<https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling>`__ script in the Examples
directory. directory.
- Supervised training - Supervised training
...@@ -120,245 +114,229 @@ language modeling head on top of the decoder. ...@@ -120,245 +114,229 @@ language modeling head on top of the decoder.
sequence "The house is wonderful." and output sequence "Das Haus ist wunderbar.", then they should be prepared for sequence "The house is wonderful." and output sequence "Das Haus ist wunderbar.", then they should be prepared for
the model as follows: the model as follows:
.. code-block:: ```python
from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("t5-small") tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small") model = T5ForConditionalGeneration.from_pretrained("t5-small")
input_ids = tokenizer('translate English to German: The house is wonderful.', return_tensors='pt').input_ids input_ids = tokenizer('translate English to German: The house is wonderful.', return_tensors='pt').input_ids
labels = tokenizer('Das Haus ist wunderbar.', return_tensors='pt').input_ids labels = tokenizer('Das Haus ist wunderbar.', return_tensors='pt').input_ids
# the forward function automatically creates the correct decoder_input_ids # the forward function automatically creates the correct decoder_input_ids
loss = model(input_ids=input_ids, labels=labels).loss loss = model(input_ids=input_ids, labels=labels).loss
```
As you can see, only 2 inputs are required for the model in order to compute a loss: :obj:`input_ids` (which are the As you can see, only 2 inputs are required for the model in order to compute a loss: `input_ids` (which are the
:obj:`input_ids` of the encoded input sequence) and :obj:`labels` (which are the :obj:`input_ids` of the encoded `input_ids` of the encoded input sequence) and `labels` (which are the `input_ids` of the encoded
target sequence). The model will automatically create the :obj:`decoder_input_ids` based on the :obj:`labels`, by target sequence). The model will automatically create the `decoder_input_ids` based on the `labels`, by
shifting them one position to the right and prepending the :obj:`config.decoder_start_token_id`, which for T5 is shifting them one position to the right and prepending the `config.decoder_start_token_id`, which for T5 is
equal to 0 (i.e. the id of the pad token). Also note the task prefix: we prepend the input sequence with 'translate equal to 0 (i.e. the id of the pad token). Also note the task prefix: we prepend the input sequence with 'translate
English to German: ' before encoding it. This will help in improving the performance, as this task prefix was used English to German: ' before encoding it. This will help in improving the performance, as this task prefix was used
during T5's pre-training. during T5's pre-training.
However, the example above only shows a single training example. In practice, one trains deep learning models in However, the example above only shows a single training example. In practice, one trains deep learning models in
batches. This entails that we must pad/truncate examples to the same length. For encoder-decoder models, one batches. This entails that we must pad/truncate examples to the same length. For encoder-decoder models, one
typically defines a :obj:`max_source_length` and :obj:`max_target_length`, which determine the maximum length of the typically defines a `max_source_length` and `max_target_length`, which determine the maximum length of the
input and output sequences respectively (otherwise they are truncated). These should be carefully set depending on input and output sequences respectively (otherwise they are truncated). These should be carefully set depending on
the task. the task.
In addition, we must make sure that padding token id's of the :obj:`labels` are not taken into account by the loss In addition, we must make sure that padding token id's of the `labels` are not taken into account by the loss
function. In PyTorch and Tensorflow, this can be done by replacing them with -100, which is the :obj:`ignore_index` function. In PyTorch and Tensorflow, this can be done by replacing them with -100, which is the `ignore_index`
of the :obj:`CrossEntropyLoss`. In Flax, one can use the :obj:`decoder_attention_mask` to ignore padded tokens from of the `CrossEntropyLoss`. In Flax, one can use the `decoder_attention_mask` to ignore padded tokens from
the loss (see the `Flax summarization script the loss (see the [Flax summarization script](https://github.com/huggingface/transformers/tree/master/examples/flax/summarization) for details). We also pass
<https://github.com/huggingface/transformers/tree/master/examples/flax/summarization>`__ for details). We also pass `attention_mask` as additional input to the model, which makes sure that padding tokens of the inputs are
:obj:`attention_mask` as additional input to the model, which makes sure that padding tokens of the inputs are
ignored. The code example below illustrates all of this. ignored. The code example below illustrates all of this.
.. code-block:: ```python
from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import T5Tokenizer, T5ForConditionalGeneration import torch
import torch
tokenizer = T5Tokenizer.from_pretrained("t5-small")
tokenizer = T5Tokenizer.from_pretrained("t5-small") model = T5ForConditionalGeneration.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")
# the following 2 hyperparameters are task-specific
# the following 2 hyperparameters are task-specific max_source_length = 512
max_source_length = 512 max_target_length = 128
max_target_length = 128
# Suppose we have the following 2 training examples:
# Suppose we have the following 2 training examples: input_sequence_1 = "Welcome to NYC"
input_sequence_1 = "Welcome to NYC" output_sequence_1 = "Bienvenue à NYC"
output_sequence_1 = "Bienvenue à NYC"
input_sequence_2 = "HuggingFace is a company"
input_sequence_2 = "HuggingFace is a company" output_sequence_2 = "HuggingFace est une entreprise"
output_sequence_2 = "HuggingFace est une entreprise"
# encode the inputs
# encode the inputs task_prefix = "translate English to French: "
task_prefix = "translate English to French: " input_sequences = [input_sequence_1, input_sequence_2]
input_sequences = [input_sequence_1, input_sequence_2] encoding = tokenizer([task_prefix + sequence for sequence in input_sequences],
encoding = tokenizer([task_prefix + sequence for sequence in input_sequences], padding='longest',
padding='longest', max_length=max_source_length,
max_length=max_source_length, truncation=True,
truncation=True, return_tensors="pt")
return_tensors="pt") input_ids, attention_mask = encoding.input_ids, encoding.attention_mask
input_ids, attention_mask = encoding.input_ids, encoding.attention_mask
# encode the targets
# encode the targets target_encoding = tokenizer([output_sequence_1, output_sequence_2],
target_encoding = tokenizer([output_sequence_1, output_sequence_2], padding='longest',
padding='longest', max_length=max_target_length,
max_length=max_target_length, truncation=True)
truncation=True) labels = target_encoding.input_ids
labels = target_encoding.input_ids
# replace padding token id's of the labels by -100
# replace padding token id's of the labels by -100 labels = torch.tensor(labels)
labels = torch.tensor(labels) labels[labels == tokenizer.pad_token_id] = -100
labels[labels == tokenizer.pad_token_id] = -100
# forward pass
# forward pass loss = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels).loss
loss = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels).loss ```
Additional training tips: Additional training tips:
- T5 models need a slightly higher learning rate than the default one set in the :obj:`Trainer` when using the AdamW - T5 models need a slightly higher learning rate than the default one set in the `Trainer` when using the AdamW
optimizer. Typically, 1e-4 and 3e-4 work well for most problems (classification, summarization, translation, question optimizer. Typically, 1e-4 and 3e-4 work well for most problems (classification, summarization, translation, question
answering, question generation). Note that T5 was pre-trained using the AdaFactor optimizer. answering, question generation). Note that T5 was pre-trained using the AdaFactor optimizer.
- According to `this forum post <https://discuss.huggingface.co/t/t5-finetuning-tips/684>`__, task prefixes matter when - According to [this forum post](https://discuss.huggingface.co/t/t5-finetuning-tips/684), task prefixes matter when
(1) doing multi-task training (2) your task is similar or related to one of the supervised tasks used in T5's (1) doing multi-task training (2) your task is similar or related to one of the supervised tasks used in T5's
pre-training mixture (see Appendix D of the `paper <https://arxiv.org/pdf/1910.10683.pdf>`__ for the task prefixes pre-training mixture (see Appendix D of the [paper](https://arxiv.org/pdf/1910.10683.pdf) for the task prefixes
used). used).
- If training on TPU, it is recommended to pad all examples of the dataset to the same length or make use of - If training on TPU, it is recommended to pad all examples of the dataset to the same length or make use of
`pad_to_multiple_of` to have a small number of predefined bucket sizes to fit all examples in. Dynamically padding *pad_to_multiple_of* to have a small number of predefined bucket sizes to fit all examples in. Dynamically padding
batches to the longest example is not recommended on TPU as it triggers a recompilation for every batch shape that is batches to the longest example is not recommended on TPU as it triggers a recompilation for every batch shape that is
encountered during training thus significantly slowing down the training. only padding up to the longest example in a encountered during training thus significantly slowing down the training. only padding up to the longest example in a
batch) leads to very slow training on TPU. batch) leads to very slow training on TPU.
.. _inference: <a id='inference'></a>
Inference ## Inference
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
At inference time, it is recommended to use :meth:`~transformers.generation_utils.GenerationMixin.generate`. This At inference time, it is recommended to use [`~generation_utils.GenerationMixin.generate`]. This
method takes care of encoding the input and feeding the encoded hidden states via cross-attention layers to the decoder method takes care of encoding the input and feeding the encoded hidden states via cross-attention layers to the decoder
and auto-regressively generates the decoder output. Check out `this blog post and auto-regressively generates the decoder output. Check out [this blog post](https://huggingface.co/blog/how-to-generate) to know all the details about generating text with Transformers.
<https://huggingface.co/blog/how-to-generate>`__ to know all the details about generating text with Transformers. There's also [this blog post](https://huggingface.co/blog/encoder-decoder#encoder-decoder) which explains how
There's also `this blog post <https://huggingface.co/blog/encoder-decoder#encoder-decoder>`__ which explains how
generation works in general in encoder-decoder models. generation works in general in encoder-decoder models.
.. code-block:: ```python
from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("t5-small") tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small") model = T5ForConditionalGeneration.from_pretrained("t5-small")
input_ids = tokenizer('translate English to German: The house is wonderful.', return_tensors='pt').input_ids input_ids = tokenizer('translate English to German: The house is wonderful.', return_tensors='pt').input_ids
outputs = model.generate(input_ids) outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Das Haus ist wunderbar. # Das Haus ist wunderbar.
```
Note that T5 uses the :obj:`pad_token_id` as the :obj:`decoder_start_token_id`, so when doing generation without using Note that T5 uses the `pad_token_id` as the `decoder_start_token_id`, so when doing generation without using
:meth:`~transformers.generation_utils.GenerationMixin.generate`, make sure you start it with the :obj:`pad_token_id`. [`~generation_utils.GenerationMixin.generate`], make sure you start it with the `pad_token_id`.
The example above only shows a single example. You can also do batched inference, like so: The example above only shows a single example. You can also do batched inference, like so:
.. code-block:: ```python
from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import T5Tokenizer, T5ForConditionalGeneration tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")
tokenizer = T5Tokenizer.from_pretrained("t5-small") # when generating, we will use the logits of right-most token to predict the next token
model = T5ForConditionalGeneration.from_pretrained("t5-small") # so the padding should be on the left
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token # to avoid an error
# when generating, we will use the logits of right-most token to predict the next token task_prefix = 'translate English to German: '
# so the padding should be on the left sentences = ['The house is wonderful.', 'I like to work in NYC.'] # use different length sentences to test batching
tokenizer.padding_side = "left" inputs = tokenizer([task_prefix + sentence for sentence in sentences], return_tensors="pt", padding=True)
tokenizer.pad_token = tokenizer.eos_token # to avoid an error
task_prefix = 'translate English to German: ' output_sequences = model.generate(
sentences = ['The house is wonderful.', 'I like to work in NYC.'] # use different length sentences to test batching input_ids=inputs['input_ids'],
inputs = tokenizer([task_prefix + sentence for sentence in sentences], return_tensors="pt", padding=True) attention_mask=inputs['attention_mask'],
do_sample=False, # disable sampling to test if batching affects output
)
output_sequences = model.generate( print(tokenizer.batch_decode(output_sequences, skip_special_tokens=True))
input_ids=inputs['input_ids'],
attention_mask=inputs['attention_mask'],
do_sample=False, # disable sampling to test if batching affects output
)
print(tokenizer.batch_decode(output_sequences, skip_special_tokens=True)) # ['Das Haus ist wunderbar.', 'Ich arbeite gerne in NYC.']
```
# ['Das Haus ist wunderbar.', 'Ich arbeite gerne in NYC.'] <a id='scripts'></a>
.. _scripts: ## Example scripts
Example scripts
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
T5 is supported by several example scripts, both for pre-training and fine-tuning. T5 is supported by several example scripts, both for pre-training and fine-tuning.
* pre-training: the `run_t5_mlm_flax.py - pre-training: the [run_t5_mlm_flax.py](https://github.com/huggingface/transformers/blob/master/examples/flax/language-modeling/run_t5_mlm_flax.py)
<https://github.com/huggingface/transformers/blob/master/examples/flax/language-modeling/run_t5_mlm_flax.py>`__ script allows you to further pre-train T5 or pre-train T5 from scratch on your own data. The [t5_tokenizer_model.py](https://github.com/huggingface/transformers/blob/master/examples/flax/language-modeling/t5_tokenizer_model.py)
script allows you to further pre-train T5 or pre-train T5 from scratch on your own data. The `t5_tokenizer_model.py
<https://github.com/huggingface/transformers/blob/master/examples/flax/language-modeling/t5_tokenizer_model.py>`__
script allows you to further train a T5 tokenizer or train a T5 Tokenizer from scratch on your own data. Note that script allows you to further train a T5 tokenizer or train a T5 Tokenizer from scratch on your own data. Note that
Flax (a neural network library on top of JAX) is particularly useful to train on TPU hardware. Flax (a neural network library on top of JAX) is particularly useful to train on TPU hardware.
* fine-tuning: T5 is supported by the official summarization scripts (`PyTorch - fine-tuning: T5 is supported by the official summarization scripts ([PyTorch](https://github.com/huggingface/transformers/tree/master/examples/pytorch/summarization), [Tensorflow](https://github.com/huggingface/transformers/tree/master/examples/tensorflow/summarization), and [Flax](https://github.com/huggingface/transformers/tree/master/examples/flax/summarization)) and translation scripts
<https://github.com/huggingface/transformers/tree/master/examples/pytorch/summarization>`__, `Tensorflow ([PyTorch](https://github.com/huggingface/transformers/tree/master/examples/pytorch/translation) and [Tensorflow](https://github.com/huggingface/transformers/tree/master/examples/tensorflow/translation)). These scripts allow
<https://github.com/huggingface/transformers/tree/master/examples/tensorflow/summarization>`__, and `Flax
<https://github.com/huggingface/transformers/tree/master/examples/flax/summarization>`__) and translation scripts
(`PyTorch <https://github.com/huggingface/transformers/tree/master/examples/pytorch/translation>`__ and `Tensorflow
<https://github.com/huggingface/transformers/tree/master/examples/tensorflow/translation>`__). These scripts allow
you to easily fine-tune T5 on custom data for summarization/translation. you to easily fine-tune T5 on custom data for summarization/translation.
T5Config ## T5Config
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.T5Config
:members:
T5Tokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.T5Tokenizer
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
create_token_type_ids_from_sequences, save_vocabulary
T5TokenizerFast [[autodoc]] T5Config
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.T5TokenizerFast ## T5Tokenizer
:members:
[[autodoc]] T5Tokenizer
- build_inputs_with_special_tokens
- get_special_tokens_mask
- create_token_type_ids_from_sequences
- save_vocabulary
T5Model ## T5TokenizerFast
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.T5Model [[autodoc]] T5TokenizerFast
:members: forward, parallelize, deparallelize
## T5Model
T5ForConditionalGeneration [[autodoc]] T5Model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - forward
- parallelize
- deparallelize
.. autoclass:: transformers.T5ForConditionalGeneration ## T5ForConditionalGeneration
:members: forward, parallelize, deparallelize
T5EncoderModel [[autodoc]] T5ForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - forward
- parallelize
- deparallelize
.. autoclass:: transformers.T5EncoderModel ## T5EncoderModel
:members: forward, parallelize, deparallelize
TFT5Model [[autodoc]] T5EncoderModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - forward
- parallelize
- deparallelize
.. autoclass:: transformers.TFT5Model ## TFT5Model
:members: call
[[autodoc]] TFT5Model
- call
TFT5ForConditionalGeneration ## TFT5ForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFT5ForConditionalGeneration [[autodoc]] TFT5ForConditionalGeneration
:members: call - call
TFT5EncoderModel ## TFT5EncoderModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFT5EncoderModel [[autodoc]] TFT5EncoderModel
:members: call - call
FlaxT5Model ## FlaxT5Model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaxT5Model [[autodoc]] FlaxT5Model
:members: __call__, encode, decode - __call__
- encode
- decode
FlaxT5ForConditionalGeneration ## FlaxT5ForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaxT5ForConditionalGeneration [[autodoc]] FlaxT5ForConditionalGeneration
:members: __call__, encode, decode - __call__
- encode
- decode
.. <!--Copyright 2021 The HuggingFace Team. All rights reserved.
Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. specific language governing permissions and limitations under the License.
-->
T5v1.1 # T5v1.1
-----------------------------------------------------------------------------------------------------------------------
Overview ## Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
T5v1.1 was released in the `google-research/text-to-text-transfer-transformer T5v1.1 was released in the [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511)
<https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511>`__
repository by Colin Raffel et al. It's an improved version of the original T5 model. repository by Colin Raffel et al. It's an improved version of the original T5 model.
One can directly plug in the weights of T5v1.1 into a T5 model, like so: One can directly plug in the weights of T5v1.1 into a T5 model, like so:
.. code-block:: ```python
from transformers import T5ForConditionalGeneration
from transformers import T5ForConditionalGeneration model = T5ForConditionalGeneration.from_pretrained('google/t5-v1_1-base')
```
model = T5ForConditionalGeneration.from_pretrained('google/t5-v1_1-base')
T5 Version 1.1 includes the following improvements compared to the original T5 model: T5 Version 1.1 includes the following improvements compared to the original T5 model:
- GEGLU activation in the feed-forward hidden layer, rather than ReLU. See `this paper - GEGLU activation in the feed-forward hidden layer, rather than ReLU. See [this paper](https://arxiv.org/abs/2002.05202).
<https://arxiv.org/abs/2002.05202>`__.
- Dropout was turned off in pre-training (quality win). Dropout should be re-enabled during fine-tuning. - Dropout was turned off in pre-training (quality win). Dropout should be re-enabled during fine-tuning.
...@@ -39,28 +35,27 @@ T5 Version 1.1 includes the following improvements compared to the original T5 m ...@@ -39,28 +35,27 @@ T5 Version 1.1 includes the following improvements compared to the original T5 m
- No parameter sharing between the embedding and classifier layer. - No parameter sharing between the embedding and classifier layer.
- "xl" and "xxl" replace "3B" and "11B". The model shapes are a bit different - larger :obj:`d_model` and smaller - "xl" and "xxl" replace "3B" and "11B". The model shapes are a bit different - larger `d_model` and smaller
:obj:`num_heads` and :obj:`d_ff`. `num_heads` and `d_ff`.
Note: T5 Version 1.1 was only pre-trained on `C4 <https://huggingface.co/datasets/c4>`__ excluding any supervised Note: T5 Version 1.1 was only pre-trained on [C4](https://huggingface.co/datasets/c4) excluding any supervised
training. Therefore, this model has to be fine-tuned before it is useable on a downstream task, unlike the original T5 training. Therefore, this model has to be fine-tuned before it is useable on a downstream task, unlike the original T5
model. Since t5v1.1 was pre-trained unsupervisedly, there's no real advantage to using a task prefix during single-task model. Since t5v1.1 was pre-trained unsupervisedly, there's no real advantage to using a task prefix during single-task
fine-tuning. If you are doing multi-task fine-tuning, you should use a prefix. fine-tuning. If you are doing multi-task fine-tuning, you should use a prefix.
Google has released the following variants: Google has released the following variants:
- `google/t5-v1_1-small <https://huggingface.co/google/t5-v1_1-small>`__ - [google/t5-v1_1-small](https://huggingface.co/google/t5-v1_1-small)
- `google/t5-v1_1-base <https://huggingface.co/google/t5-v1_1-base>`__ - [google/t5-v1_1-base](https://huggingface.co/google/t5-v1_1-base)
- `google/t5-v1_1-large <https://huggingface.co/google/t5-v1_1-large>`__ - [google/t5-v1_1-large](https://huggingface.co/google/t5-v1_1-large)
- `google/t5-v1_1-xl <https://huggingface.co/google/t5-v1_1-xl>`__ - [google/t5-v1_1-xl](https://huggingface.co/google/t5-v1_1-xl)
- `google/t5-v1_1-xxl <https://huggingface.co/google/t5-v1_1-xxl>`__. - [google/t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl).
One can refer to :doc:`T5's documentation page <t5>` for all tips, code examples and notebooks. One can refer to [T5's documentation page](t5) for all tips, code examples and notebooks.
This model was contributed by `patrickvonplaten <https://huggingface.co/patrickvonplaten>`__. The original code can be This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The original code can be
found `here found [here](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511).
<https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511>`__.
...@@ -36,7 +36,7 @@ In addition, the authors have further pre-trained TAPAS to recognize **table ent ...@@ -36,7 +36,7 @@ In addition, the authors have further pre-trained TAPAS to recognize **table ent
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tapas_architecture.png" <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tapas_architecture.png"
alt="drawing" width="600"/> alt="drawing" width="600"/>
<small> TAPAS architecture. Taken from the [official blog post](https://ai.googleblog.com/2020/04/using-neural-networks-to-find-answers.html). </small> <small> TAPAS architecture. Taken from the <a href="https://ai.googleblog.com/2020/04/using-neural-networks-to-find-answers.html">original blog post</a>.</small>
This model was contributed by [nielsr](https://huggingface.co/nielsr). The Tensorflow version of this model was contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/google-research/tapas). This model was contributed by [nielsr](https://huggingface.co/nielsr). The Tensorflow version of this model was contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/google-research/tapas).
......
.. <!--Copyright 2020 The HuggingFace Team. All rights reserved.
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. specific language governing permissions and limitations under the License.
-->
Transformer XL # Transformer XL
-----------------------------------------------------------------------------------------------------------------------
Overview ## Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The Transformer-XL model was proposed in `Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context The Transformer-XL model was proposed in [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan
<https://arxiv.org/abs/1901.02860>`__ by Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan
Salakhutdinov. It's a causal (uni-directional) transformer with relative positioning (sinusoïdal) embeddings which can Salakhutdinov. It's a causal (uni-directional) transformer with relative positioning (sinusoïdal) embeddings which can
reuse previously computed hidden-states to attend to longer context (memory). This model also uses adaptive softmax reuse previously computed hidden-states to attend to longer context (memory). This model also uses adaptive softmax
inputs and outputs (tied). inputs and outputs (tied).
...@@ -41,90 +38,66 @@ Tips: ...@@ -41,90 +38,66 @@ Tips:
original implementation trains on SQuAD with padding on the left, therefore the padding defaults are set to left. original implementation trains on SQuAD with padding on the left, therefore the padding defaults are set to left.
- Transformer-XL is one of the few models that has no sequence length limit. - Transformer-XL is one of the few models that has no sequence length limit.
This model was contributed by `thomwolf <https://huggingface.co/thomwolf>`__. The original code can be found `here This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/kimiyoung/transformer-xl).
<https://github.com/kimiyoung/transformer-xl>`__.
**Note**: <Tip warning={true}>
- TransformerXL does **not** work with `torch.nn.DataParallel` due to a bug in PyTorch, see `issue #36035 TransformerXL does **not** work with *torch.nn.DataParallel* due to a bug in PyTorch, see [issue #36035](https://github.com/pytorch/pytorch/issues/36035)
<https://github.com/pytorch/pytorch/issues/36035>`__
</Tip>
TransfoXLConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TransfoXLConfig ## TransfoXLConfig
:members:
[[autodoc]] TransfoXLConfig
TransfoXLTokenizer ## TransfoXLTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TransfoXLTokenizer [[autodoc]] TransfoXLTokenizer
:members: save_vocabulary - save_vocabulary
## TransfoXL specific outputs
TransfoXL specific outputs [[autodoc]] models.transfo_xl.modeling_transfo_xl.TransfoXLModelOutput
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.transfo_xl.modeling_transfo_xl.TransfoXLModelOutput [[autodoc]] models.transfo_xl.modeling_transfo_xl.TransfoXLLMHeadModelOutput
:members:
.. autoclass:: transformers.models.transfo_xl.modeling_transfo_xl.TransfoXLLMHeadModelOutput [[autodoc]] models.transfo_xl.modeling_tf_transfo_xl.TFTransfoXLModelOutput
:members:
.. autoclass:: transformers.models.transfo_xl.modeling_tf_transfo_xl.TFTransfoXLModelOutput [[autodoc]] models.transfo_xl.modeling_tf_transfo_xl.TFTransfoXLLMHeadModelOutput
:members:
.. autoclass:: transformers.models.transfo_xl.modeling_tf_transfo_xl.TFTransfoXLLMHeadModelOutput ## TransfoXLModel
:members:
[[autodoc]] TransfoXLModel
- forward
TransfoXLModel ## TransfoXLLMHeadModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TransfoXLModel [[autodoc]] TransfoXLLMHeadModel
:members: forward - forward
## TransfoXLForSequenceClassification
TransfoXLLMHeadModel [[autodoc]] TransfoXLForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - forward
.. autoclass:: transformers.TransfoXLLMHeadModel ## TFTransfoXLModel
:members: forward
[[autodoc]] TFTransfoXLModel
- call
TransfoXLForSequenceClassification ## TFTransfoXLLMHeadModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TransfoXLForSequenceClassification [[autodoc]] TFTransfoXLLMHeadModel
:members: forward - call
## TFTransfoXLForSequenceClassification
TFTransfoXLModel [[autodoc]] TFTransfoXLForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - call
.. autoclass:: transformers.TFTransfoXLModel ## Internal Layers
:members: call
[[autodoc]] AdaptiveEmbedding
TFTransfoXLLMHeadModel [[autodoc]] TFAdaptiveEmbedding
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFTransfoXLLMHeadModel
:members: call
TFTransfoXLForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFTransfoXLForSequenceClassification
:members: call
Internal Layers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.AdaptiveEmbedding
.. autoclass:: transformers.TFAdaptiveEmbedding
...@@ -32,7 +32,7 @@ tasks.* ...@@ -32,7 +32,7 @@ tasks.*
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/trocr_architecture.jpg" <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/trocr_architecture.jpg"
alt="drawing" width="600"/> alt="drawing" width="600"/>
<small> TrOCR architecture. Taken from the [original paper](https://arxiv.org/abs/2109.10282). </small> <small> TrOCR architecture. Taken from the <a href="https://arxiv.org/abs/2109.10282">original paper</a>. </small>
Please refer to the [`VisionEncoderDecoder`] class on how to use this model. Please refer to the [`VisionEncoderDecoder`] class on how to use this model.
......
.. <!--Copyright 2021 The HuggingFace Team. All rights reserved.
Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. specific language governing permissions and limitations under the License.
-->
UniSpeech # UniSpeech
-----------------------------------------------------------------------------------------------------------------------
Overview ## Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The UniSpeech model was proposed in `UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data The UniSpeech model was proposed in [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael
<https://arxiv.org/abs/2101.07597>`__ by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael
Zeng, Xuedong Huang . Zeng, Xuedong Huang .
The abstract from the paper is the following: The abstract from the paper is the following:
...@@ -35,54 +32,40 @@ i.e., a relative word error rate reduction of 6% against the previous approach.* ...@@ -35,54 +32,40 @@ i.e., a relative word error rate reduction of 6% against the previous approach.*
Tips: Tips:
- UniSpeech is a speech model that accepts a float array corresponding to the raw waveform of the speech signal. Please - UniSpeech is a speech model that accepts a float array corresponding to the raw waveform of the speech signal. Please
use :class:`~transformers.Wav2Vec2Processor` for the feature extraction. use [`Wav2Vec2Processor`] for the feature extraction.
- UniSpeech model can be fine-tuned using connectionist temporal classification (CTC) so the model output has to be - UniSpeech model can be fine-tuned using connectionist temporal classification (CTC) so the model output has to be
decoded using :class:`~transformers.Wav2Vec2CTCTokenizer`. decoded using [`Wav2Vec2CTCTokenizer`].
This model was contributed by `patrickvonplaten <https://huggingface.co/patrickvonplaten>`__. The Authors' code can be This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The Authors' code can be
found `here <https://github.com/microsoft/UniSpeech/tree/main/UniSpeech>`__. found [here](https://github.com/microsoft/UniSpeech/tree/main/UniSpeech).
UniSpeechConfig ## UniSpeechConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.UniSpeechConfig [[autodoc]] UniSpeechConfig
:members:
## UniSpeech specific outputs
UniSpeech specific outputs [[autodoc]] models.unispeech.modeling_unispeech.UniSpeechBaseModelOutput
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.unispeech.modeling_unispeech.UniSpeechBaseModelOutput [[autodoc]] models.unispeech.modeling_unispeech.UniSpeechForPreTrainingOutput
:members:
.. autoclass:: transformers.models.unispeech.modeling_unispeech.UniSpeechForPreTrainingOutput ## UniSpeechModel
:members:
[[autodoc]] UniSpeechModel
- forward
UniSpeechModel ## UniSpeechForCTC
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.UniSpeechModel [[autodoc]] UniSpeechForCTC
:members: forward - forward
## UniSpeechForSequenceClassification
UniSpeechForCTC [[autodoc]] UniSpeechForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - forward
.. autoclass:: transformers.UniSpeechForCTC ## UniSpeechForPreTraining
:members: forward
[[autodoc]] UniSpeechForPreTraining
UniSpeechForSequenceClassification - forward
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.UniSpeechForSequenceClassification
:members: forward
UniSpeechForPreTraining
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.UniSpeechForPreTraining
:members: forward
.. <!--Copyright 2021 The HuggingFace Team. All rights reserved.
Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. specific language governing permissions and limitations under the License.
-->
UniSpeech-SAT # UniSpeech-SAT
-----------------------------------------------------------------------------------------------------------------------
Overview ## Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The UniSpeech-SAT model was proposed in `UniSpeech-SAT: Universal Speech Representation Learning with Speaker Aware The UniSpeech-SAT model was proposed in [UniSpeech-SAT: Universal Speech Representation Learning with Speaker Aware
Pre-Training <https://arxiv.org/abs/2110.05752>`__ by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Pre-Training](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen,
Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu . Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu .
The abstract from the paper is the following: The abstract from the paper is the following:
...@@ -38,69 +36,51 @@ dataset to 94 thousand hours public audio data and achieve further performance i ...@@ -38,69 +36,51 @@ dataset to 94 thousand hours public audio data and achieve further performance i
Tips: Tips:
- UniSpeechSat is a speech model that accepts a float array corresponding to the raw waveform of the speech signal. - UniSpeechSat is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
Please use :class:`~transformers.Wav2Vec2Processor` for the feature extraction. Please use [`Wav2Vec2Processor`] for the feature extraction.
- UniSpeechSat model can be fine-tuned using connectionist temporal classification (CTC) so the model output has to be - UniSpeechSat model can be fine-tuned using connectionist temporal classification (CTC) so the model output has to be
decoded using :class:`~transformers.Wav2Vec2CTCTokenizer`. decoded using [`Wav2Vec2CTCTokenizer`].
- UniSpeechSat performs especially well on speaker verification, speaker identification, and speaker diarization tasks. - UniSpeechSat performs especially well on speaker verification, speaker identification, and speaker diarization tasks.
This model was contributed by `patrickvonplaten <https://huggingface.co/patrickvonplaten>`__. The Authors' code can be This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The Authors' code can be
found `here <https://github.com/microsoft/UniSpeech/tree/main/UniSpeech-SAT>`__. found [here](https://github.com/microsoft/UniSpeech/tree/main/UniSpeech-SAT).
UniSpeechSatConfig ## UniSpeechSatConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.UniSpeechSatConfig [[autodoc]] UniSpeechSatConfig
:members:
## UniSpeechSat specific outputs
UniSpeechSat specific outputs [[autodoc]] models.unispeech_sat.modeling_unispeech_sat.UniSpeechSatBaseModelOutput
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.unispeech_sat.modeling_unispeech_sat.UniSpeechSatBaseModelOutput [[autodoc]] models.unispeech_sat.modeling_unispeech_sat.UniSpeechSatForPreTrainingOutput
:members:
.. autoclass:: transformers.models.unispeech_sat.modeling_unispeech_sat.UniSpeechSatForPreTrainingOutput ## UniSpeechSatModel
:members:
[[autodoc]] UniSpeechSatModel
- forward
UniSpeechSatModel ## UniSpeechSatForCTC
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.UniSpeechSatModel [[autodoc]] UniSpeechSatForCTC
:members: forward - forward
## UniSpeechSatForSequenceClassification
UniSpeechSatForCTC [[autodoc]] UniSpeechSatForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - forward
.. autoclass:: transformers.UniSpeechSatForCTC ## UniSpeechSatForAudioFrameClassification
:members: forward
[[autodoc]] UniSpeechSatForAudioFrameClassification
- forward
UniSpeechSatForSequenceClassification ## UniSpeechSatForXVector
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.UniSpeechSatForSequenceClassification [[autodoc]] UniSpeechSatForXVector
:members: forward - forward
## UniSpeechSatForPreTraining
UniSpeechSatForAudioFrameClassification [[autodoc]] UniSpeechSatForPreTraining
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - forward
.. autoclass:: transformers.UniSpeechSatForAudioFrameClassification
:members: forward
UniSpeechSatForXVector
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.UniSpeechSatForXVector
:members: forward
UniSpeechSatForPreTraining
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.UniSpeechSatForPreTraining
:members: forward
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# VisionTextDualEncoder
## Overview
The [`VisionTextDualEncoderModel`] can be used to initialize a vision-text dual encoder model with
any pretrained vision autoencoding model as the vision encoder (*e.g.* [ViT](vit), [BEiT](beit), [DeiT](deit)) and any pretrained text autoencoding model as the text encoder (*e.g.* [RoBERTa](roberta), [BERT](bert)). Two projection layers are added on top of both the vision and text encoder to project the output embeddings
to a shared latent space. The projection layers are randomly initialized so the model should be fine-tuned on a
downstream task. This model can be used to align the vision-text embeddings using CLIP like contrastive image-text
training and then can be used for zero-shot vision tasks such image-classification or retrieval.
In [LiT: Zero-Shot Transfer with Locked-image Text Tuning](https://arxiv.org/abs/2111.07991) it is shown how
leveraging pre-trained (locked/frozen) image and text model for contrastive learning yields significant improvment on
new zero-shot vision tasks such as image classification or retrieval.
## VisionTextDualEncoderConfig
[[autodoc]] VisionTextDualEncoderConfig
## VisionTextDualEncoderProcessor
[[autodoc]] VisionTextDualEncoderProcessor
## VisionTextDualEncoderModel
[[autodoc]] VisionTextDualEncoderModel
- forward
## FlaxVisionTextDualEncoderModel
[[autodoc]] FlaxVisionTextDualEncoderModel
- __call__
..
Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
VisionTextDualEncoder
-----------------------------------------------------------------------------------------------------------------------
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The :class:`~transformers.VisionTextDualEncoderModel` can be used to initialize a vision-text dual encoder model with
any pretrained vision autoencoding model as the vision encoder (*e.g.* :doc:`ViT <vit>`, :doc:`BEiT <beit>`, :doc:`DeiT
<deit>`) and any pretrained text autoencoding model as the text encoder (*e.g.* :doc:`RoBERTa <roberta>`, :doc:`BERT
<bert>`). Two projection layers are added on top of both the vision and text encoder to project the output embeddings
to a shared latent space. The projection layers are randomly initialized so the model should be fine-tuned on a
downstream task. This model can be used to align the vision-text embeddings using CLIP like contrastive image-text
training and then can be used for zero-shot vision tasks such image-classification or retrieval.
In `LiT: Zero-Shot Transfer with Locked-image Text Tuning <https://arxiv.org/abs/2111.07991>`__ it is shown how
leveraging pre-trained (locked/frozen) image and text model for contrastive learning yields significant improvment on
new zero-shot vision tasks such as image classification or retrieval.
VisionTextDualEncoderConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.VisionTextDualEncoderConfig
:members:
VisionTextDualEncoderProcessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.VisionTextDualEncoderProcessor
:members:
VisionTextDualEncoderModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.VisionTextDualEncoderModel
:members: forward
FlaxVisionTextDualEncoderModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaxVisionTextDualEncoderModel
:members: __call__
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Vision Encoder Decoder Models
The [`VisionEncoderDecoderModel`] can be used to initialize an image-to-text-sequence model with any
pretrained vision autoencoding model as the encoder (*e.g.* [ViT](vit), [BEiT](beit), [DeiT](deit))
and any pretrained language model as the decoder (*e.g.* [RoBERTa](roberta), [GPT2](gpt2), [BERT](bert)).
The effectiveness of initializing image-to-text-sequence models with pretrained checkpoints has been shown in (for
example) [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang,
Zhoujun Li, Furu Wei.
An example of how to use a [`VisionEncoderDecoderModel`] for inference can be seen in [TrOCR](trocr).
## VisionEncoderDecoderConfig
[[autodoc]] VisionEncoderDecoderConfig
## VisionEncoderDecoderModel
[[autodoc]] VisionEncoderDecoderModel
- forward
- from_encoder_decoder_pretrained
## FlaxVisionEncoderDecoderModel
[[autodoc]] FlaxVisionEncoderDecoderModel
- __call__
- from_encoder_decoder_pretrained
..
Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
Vision Encoder Decoder Models
-----------------------------------------------------------------------------------------------------------------------
The :class:`~transformers.VisionEncoderDecoderModel` can be used to initialize an image-to-text-sequence model with any
pretrained vision autoencoding model as the encoder (*e.g.* :doc:`ViT <vit>`, :doc:`BEiT <beit>`, :doc:`DeiT <deit>`)
and any pretrained language model as the decoder (*e.g.* :doc:`RoBERTa <roberta>`, :doc:`GPT2 <gpt2>`, :doc:`BERT
<bert>`).
The effectiveness of initializing image-to-text-sequence models with pretrained checkpoints has been shown in (for
example) `TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models
<https://arxiv.org/abs/2109.10282>`__ by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang,
Zhoujun Li, Furu Wei.
An example of how to use a :class:`~transformers.VisionEncoderDecoderModel` for inference can be seen in :doc:`TrOCR
<trocr>`.
VisionEncoderDecoderConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.VisionEncoderDecoderConfig
:members:
VisionEncoderDecoderModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.VisionEncoderDecoderModel
:members: forward, from_encoder_decoder_pretrained
FlaxVisionEncoderDecoderModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaxVisionEncoderDecoderModel
:members: __call__, from_encoder_decoder_pretrained
.. <!--Copyright 2021 The HuggingFace Team. All rights reserved.
Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. specific language governing permissions and limitations under the License.
-->
VisualBERT # VisualBERT
-----------------------------------------------------------------------------------------------------------------------
Overview ## Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The VisualBERT model was proposed in `VisualBERT: A Simple and Performant Baseline for Vision and Language The VisualBERT model was proposed in [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
<https://arxiv.org/pdf/1908.03557>`__ by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
VisualBERT is a neural network trained on a variety of (image, text) pairs. VisualBERT is a neural network trained on a variety of (image, text) pairs.
The abstract from the paper is the following: The abstract from the paper is the following:
...@@ -33,7 +30,7 @@ verbs and image regions corresponding to their arguments.* ...@@ -33,7 +30,7 @@ verbs and image regions corresponding to their arguments.*
Tips: Tips:
1. Most of the checkpoints provided work with the :class:`~transformers.VisualBertForPreTraining` configuration. Other 1. Most of the checkpoints provided work with the [`VisualBertForPreTraining`] configuration. Other
checkpoints provided are the fine-tuned checkpoints for down-stream tasks - VQA ('visualbert-vqa'), VCR checkpoints provided are the fine-tuned checkpoints for down-stream tasks - VQA ('visualbert-vqa'), VCR
('visualbert-vcr'), NLVR2 ('visualbert-nlvr2'). Hence, if you are not working on these downstream tasks, it is ('visualbert-vcr'), NLVR2 ('visualbert-nlvr2'). Hence, if you are not working on these downstream tasks, it is
recommended that you use the pretrained checkpoints. recommended that you use the pretrained checkpoints.
...@@ -42,8 +39,7 @@ Tips: ...@@ -42,8 +39,7 @@ Tips:
We do not provide the detector and its weights as a part of the package, but it will be available in the research We do not provide the detector and its weights as a part of the package, but it will be available in the research
projects, and the states can be loaded directly into the detector provided. projects, and the states can be loaded directly into the detector provided.
Usage ## Usage
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
VisualBERT is a multi-modal vision and language model. It can be used for visual question answering, multiple choice, VisualBERT is a multi-modal vision and language model. It can be used for visual question answering, multiple choice,
visual reasoning and region-to-phrase correspondence tasks. VisualBERT uses a BERT-like transformer to prepare visual reasoning and region-to-phrase correspondence tasks. VisualBERT uses a BERT-like transformer to prepare
...@@ -57,87 +53,71 @@ vectors to a standard BERT model. The text input is concatenated in the front of ...@@ -57,87 +53,71 @@ vectors to a standard BERT model. The text input is concatenated in the front of
layer, and is expected to be bound by [CLS] and a [SEP] tokens, as in BERT. The segment IDs must also be set layer, and is expected to be bound by [CLS] and a [SEP] tokens, as in BERT. The segment IDs must also be set
appropriately for the textual and visual parts. appropriately for the textual and visual parts.
The :class:`~transformers.BertTokenizer` is used to encode the text. A custom detector/feature extractor must be used The [`BertTokenizer`] is used to encode the text. A custom detector/feature extractor must be used
to get the visual embeddings. The following example notebooks show how to use VisualBERT with Detectron-like models: to get the visual embeddings. The following example notebooks show how to use VisualBERT with Detectron-like models:
* `VisualBERT VQA demo notebook - [VisualBERT VQA demo notebook](https://github.com/huggingface/transformers/tree/master/examples/research_projects/visual_bert) : This notebook
<https://github.com/huggingface/transformers/tree/master/examples/research_projects/visual_bert>`__ : This notebook
contains an example on VisualBERT VQA. contains an example on VisualBERT VQA.
* `Generate Embeddings for VisualBERT (Colab Notebook) - [Generate Embeddings for VisualBERT (Colab Notebook)](https://colab.research.google.com/drive/1bLGxKdldwqnMVA5x4neY7-l_8fKGWQYI?usp=sharing) : This notebook contains
<https://colab.research.google.com/drive/1bLGxKdldwqnMVA5x4neY7-l_8fKGWQYI?usp=sharing>`__ : This notebook contains
an example on how to generate visual embeddings. an example on how to generate visual embeddings.
The following example shows how to get the last hidden state using :class:`~transformers.VisualBertModel`: The following example shows how to get the last hidden state using [`VisualBertModel`]:
.. code-block:: ```python
>>> import torch
>>> from transformers import BertTokenizer, VisualBertModel
>>> import torch >>> model = VisualBertModel.from_pretrained("uclanlp/visualbert-vqa-coco-pre")
>>> from transformers import BertTokenizer, VisualBertModel >>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
>>> model = VisualBertModel.from_pretrained("uclanlp/visualbert-vqa-coco-pre") >>> inputs = tokenizer("What is the man eating?", return_tensors="pt")
>>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") >>> # this is a custom function that returns the visual embeddings given the image path
>>> visual_embeds = get_visual_embeddings(image_path)
>>> inputs = tokenizer("What is the man eating?", return_tensors="pt") >>> visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long)
>>> # this is a custom function that returns the visual embeddings given the image path >>> visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.float)
>>> visual_embeds = get_visual_embeddings(image_path) >>> inputs.update({
... "visual_embeds": visual_embeds,
... "visual_token_type_ids": visual_token_type_ids,
... "visual_attention_mask": visual_attention_mask
... })
>>> outputs = model(**inputs)
>>> last_hidden_state = outputs.last_hidden_state
```
>>> visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long) This model was contributed by [gchhablani](https://huggingface.co/gchhablani). The original code can be found [here](https://github.com/uclanlp/visualbert).
>>> visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.float)
>>> inputs.update({
... "visual_embeds": visual_embeds,
... "visual_token_type_ids": visual_token_type_ids,
... "visual_attention_mask": visual_attention_mask
... })
>>> outputs = model(**inputs)
>>> last_hidden_state = outputs.last_hidden_state
This model was contributed by `gchhablani <https://huggingface.co/gchhablani>`__. The original code can be found `here ## VisualBertConfig
<https://github.com/uclanlp/visualbert>`__.
VisualBertConfig [[autodoc]] VisualBertConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.VisualBertConfig ## VisualBertModel
:members:
VisualBertModel [[autodoc]] VisualBertModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - forward
.. autoclass:: transformers.VisualBertModel ## VisualBertForPreTraining
:members: forward
[[autodoc]] VisualBertForPreTraining
- forward
VisualBertForPreTraining ## VisualBertForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.VisualBertForPreTraining [[autodoc]] VisualBertForQuestionAnswering
:members: forward - forward
## VisualBertForMultipleChoice
VisualBertForQuestionAnswering [[autodoc]] VisualBertForMultipleChoice
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - forward
.. autoclass:: transformers.VisualBertForQuestionAnswering ## VisualBertForVisualReasoning
:members: forward
[[autodoc]] VisualBertForVisualReasoning
- forward
VisualBertForMultipleChoice ## VisualBertForRegionToPhraseAlignment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.VisualBertForMultipleChoice [[autodoc]] VisualBertForRegionToPhraseAlignment
:members: forward - forward
VisualBertForVisualReasoning
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.VisualBertForVisualReasoning
:members: forward
VisualBertForRegionToPhraseAlignment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.VisualBertForRegionToPhraseAlignment
:members: forward
.. <!--Copyright 2021 The HuggingFace Team. All rights reserved.
Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. specific language governing permissions and limitations under the License.
-->
Vision Transformer (ViT) # Vision Transformer (ViT)
-----------------------------------------------------------------------------------------------------------------------
.. note:: <Tip>
This is a recently introduced model so the API hasn't been tested extensively. There may be some bugs or slight This is a recently introduced model so the API hasn't been tested extensively. There may be some bugs or slight
breaking changes to fix it in the future. If you see something strange, file a `Github Issue breaking changes to fix it in the future. If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title).
<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__.
</Tip>
Overview ## Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The Vision Transformer (ViT) model was proposed in `An Image is Worth 16x16 Words: Transformers for Image Recognition The Vision Transformer (ViT) model was proposed in [An Image is Worth 16x16 Words: Transformers for Image Recognition
at Scale <https://arxiv.org/abs/2010.11929>`__ by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk
Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob
Uszkoreit, Neil Houlsby. It's the first paper that successfully trains a Transformer encoder on ImageNet, attaining Uszkoreit, Neil Houlsby. It's the first paper that successfully trains a Transformer encoder on ImageNet, attaining
very good results compared to familiar convolutional architectures. very good results compared to familiar convolutional architectures.
...@@ -43,25 +41,22 @@ substantially fewer computational resources to train.* ...@@ -43,25 +41,22 @@ substantially fewer computational resources to train.*
Tips: Tips:
- Demo notebooks regarding inference as well as fine-tuning ViT on custom data can be found `here - Demo notebooks regarding inference as well as fine-tuning ViT on custom data can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer).
<https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer>`__.
- To feed images to the Transformer encoder, each image is split into a sequence of fixed-size non-overlapping patches, - To feed images to the Transformer encoder, each image is split into a sequence of fixed-size non-overlapping patches,
which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image, which can be which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image, which can be
used for classification. The authors also add absolute position embeddings, and feed the resulting sequence of used for classification. The authors also add absolute position embeddings, and feed the resulting sequence of
vectors to a standard Transformer encoder. vectors to a standard Transformer encoder.
- As the Vision Transformer expects each image to be of the same size (resolution), one can use - As the Vision Transformer expects each image to be of the same size (resolution), one can use
:class:`~transformers.ViTFeatureExtractor` to resize (or rescale) and normalize images for the model. [`ViTFeatureExtractor`] to resize (or rescale) and normalize images for the model.
- Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of - Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of
each checkpoint. For example, :obj:`google/vit-base-patch16-224` refers to a base-sized architecture with patch each checkpoint. For example, `google/vit-base-patch16-224` refers to a base-sized architecture with patch
resolution of 16x16 and fine-tuning resolution of 224x224. All checkpoints can be found on the `hub resolution of 16x16 and fine-tuning resolution of 224x224. All checkpoints can be found on the [hub](https://huggingface.co/models?search=vit).
<https://huggingface.co/models?search=vit>`__. - The available checkpoints are either (1) pre-trained on [ImageNet-21k](http://www.image-net.org/) (a collection of
- The available checkpoints are either (1) pre-trained on `ImageNet-21k <http://www.image-net.org/>`__ (a collection of 14 million images and 21k classes) only, or (2) also fine-tuned on [ImageNet](http://www.image-net.org/challenges/LSVRC/2012/) (also referred to as ILSVRC 2012, a collection of 1.3 million
14 million images and 21k classes) only, or (2) also fine-tuned on `ImageNet
<http://www.image-net.org/challenges/LSVRC/2012/>`__ (also referred to as ILSVRC 2012, a collection of 1.3 million
images and 1,000 classes). images and 1,000 classes).
- The Vision Transformer was pre-trained using a resolution of 224x224. During fine-tuning, it is often beneficial to - The Vision Transformer was pre-trained using a resolution of 224x224. During fine-tuning, it is often beneficial to
use a higher resolution than pre-training `(Touvron et al., 2019) <https://arxiv.org/abs/1906.06423>`__, `(Kolesnikov use a higher resolution than pre-training [(Touvron et al., 2019)](https://arxiv.org/abs/1906.06423), [(Kolesnikov
et al., 2020) <https://arxiv.org/abs/1912.11370>`__. In order to fine-tune at higher resolution, the authors perform et al., 2020)](https://arxiv.org/abs/1912.11370). In order to fine-tune at higher resolution, the authors perform
2D interpolation of the pre-trained position embeddings, according to their location in the original image. 2D interpolation of the pre-trained position embeddings, according to their location in the original image.
- The best results are obtained with supervised pre-training, which is not the case in NLP. The authors also performed - The best results are obtained with supervised pre-training, which is not the case in NLP. The authors also performed
an experiment with a self-supervised pre-training objective, namely masked patched prediction (inspired by masked an experiment with a self-supervised pre-training objective, namely masked patched prediction (inspired by masked
...@@ -71,81 +66,62 @@ Tips: ...@@ -71,81 +66,62 @@ Tips:
Following the original Vision Transformer, some follow-up works have been made: Following the original Vision Transformer, some follow-up works have been made:
- DeiT (Data-efficient Image Transformers) by Facebook AI. DeiT models are distilled vision transformers. Refer to - DeiT (Data-efficient Image Transformers) by Facebook AI. DeiT models are distilled vision transformers. Refer to
:doc:`DeiT's documentation page <deit>`. The authors of DeiT also released more efficiently trained ViT models, which [DeiT's documentation page](deit). The authors of DeiT also released more efficiently trained ViT models, which
you can directly plug into :class:`~transformers.ViTModel` or :class:`~transformers.ViTForImageClassification`. There you can directly plug into [`ViTModel`] or [`ViTForImageClassification`]. There
are 4 variants available (in 3 different sizes): `facebook/deit-tiny-patch16-224`, `facebook/deit-small-patch16-224`, are 4 variants available (in 3 different sizes): *facebook/deit-tiny-patch16-224*, *facebook/deit-small-patch16-224*,
`facebook/deit-base-patch16-224` and `facebook/deit-base-patch16-384`. Note that one should use *facebook/deit-base-patch16-224* and *facebook/deit-base-patch16-384*. Note that one should use
:class:`~transformers.DeiTFeatureExtractor` in order to prepare images for the model. [`DeiTFeatureExtractor`] in order to prepare images for the model.
- BEiT (BERT pre-training of Image Transformers) by Microsoft Research. BEiT models outperform supervised pre-trained - BEiT (BERT pre-training of Image Transformers) by Microsoft Research. BEiT models outperform supervised pre-trained
vision transformers using a self-supervised method inspired by BERT (masked image modeling) and based on a VQ-VAE. vision transformers using a self-supervised method inspired by BERT (masked image modeling) and based on a VQ-VAE.
Refer to :doc:`BEiT's documentation page <beit>`. Refer to [BEiT's documentation page](beit).
- DINO (a method for self-supervised training of Vision Transformers) by Facebook AI. Vision Transformers trained using - DINO (a method for self-supervised training of Vision Transformers) by Facebook AI. Vision Transformers trained using
the DINO method show very interesting properties not seen with convolutional models. They are capable of segmenting the DINO method show very interesting properties not seen with convolutional models. They are capable of segmenting
objects, without having ever been trained to do so. DINO checkpoints can be found on the `hub objects, without having ever been trained to do so. DINO checkpoints can be found on the [hub](https://huggingface.co/models?other=dino).
<https://huggingface.co/models?other=dino>`__.
This model was contributed by `nielsr <https://huggingface.co/nielsr>`__. The original code (written in JAX) can be This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code (written in JAX) can be
found `here <https://github.com/google-research/vision_transformer>`__. found [here](https://github.com/google-research/vision_transformer).
Note that we converted the weights from Ross Wightman's `timm library Note that we converted the weights from Ross Wightman's [timm library](https://github.com/rwightman/pytorch-image-models), who already converted the weights from JAX to PyTorch. Credits
<https://github.com/rwightman/pytorch-image-models>`__, who already converted the weights from JAX to PyTorch. Credits
go to him! go to him!
ViTConfig ## ViTConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ViTConfig [[autodoc]] ViTConfig
:members:
## ViTFeatureExtractor
ViTFeatureExtractor [[autodoc]] ViTFeatureExtractor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - __call__
.. autoclass:: transformers.ViTFeatureExtractor ## ViTModel
:members: __call__
[[autodoc]] ViTModel
- forward
ViTModel ## ViTForImageClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ViTModel [[autodoc]] ViTForImageClassification
:members: forward - forward
## TFViTModel
ViTForImageClassification [[autodoc]] TFViTModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - call
.. autoclass:: transformers.ViTForImageClassification ## TFViTForImageClassification
:members: forward
[[autodoc]] TFViTForImageClassification
- call
TFViTModel ## FlaxVitModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFViTModel [[autodoc]] FlaxViTModel
:members: call - __call__
## FlaxViTForImageClassification
TFViTForImageClassification [[autodoc]] FlaxViTForImageClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - __call__
.. autoclass:: transformers.TFViTForImageClassification
:members: call
FlaxVitModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaxViTModel
:members: __call__
FlaxViTForImageClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaxViTForImageClassification
:members: __call__
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment