Unverified Commit 65b20b73 authored by NielsRogge's avatar NielsRogge Committed by GitHub
Browse files

Add Perceiver IO (#14487)

* First draft

* Style and remove mlm

* Make forward pass work

* More improvements

* More improvements

* Fix bug

* More improvements

* More improvements

* Add PerceiverTokenizer first draft

* Improve conversion script

* More improvements

* Make conversion script work for the encoder

* Make conversion script work with local pickle files

* Style & quality, fix-copies

* Add dummy input to conversion script

* Add absolute position embeddings to TextPreProcessor

* Make forward pass of encoder work

* More improvements

* Move text preprocessor to separate script

* More improvements

* More improvements

* Add post processor

* Make MLM model work

* Style

* Add PerceiverForMaskedLM

* Add PerceiverImagePreprocessor

* Make style

* Make PerceiverForImageClassification work

* More improvements

* More improvements

* Use tokenizer in conversion script

* Use PerceiverForMaskedLM in conversion script

* Define custom PerceiverModelOutput

* Improve PerceiverAttention to make it work for both MLM and image classification

* More improvements

* More improvements

* More improvements to the conversion script

* Make conversion script work for both MLM and image classification

* Add PerceiverFeatureExtractor

* More improvements

* Style and quality

* Add center cropping

* Fix bug

* Small fix

* Add print statement

* Fix bug in image preprocessor

* Fix bug with conversion script

* Make output position embeddings an nn.Parameter layer instead of nn.Embedding

* Comment out print statements

* Add position encoding classes

* More improvements

* Use position_encoding_kwargs

* Add PerceiverForImageClassificationFourier

* Make style & quality

* Add PerceiverForImageClassificationConvProcessing

* Style & quality

* Add flow model

* Move processors to modeling file

* Make position encodings modular

* Make basic decoder use modular position encodings

* Add PerceiverForOpticalFlow to conversion script

* Add AudioPreprocessor

* Make it possible for the basic decoder to use Fourier position embeddings

* Add PerceiverForMultimodalAutoencoding

* Improve model for optical flow

* Improve _build_network_inputs method

* Add print statement

* Fix device issue

* Fix device of Fourier embeddings

* Add print statements for debugging

* Add another print statement

* Add another print statement

* Add another print statement

* Add another print statement

* Improve PerceiverAudioPreprocessor

* Improve conversion script for multimodal modal

* More improvements

* More improvements

* Improve multimodal model

* Make forward pass multimodal model work

* More improvements

* Improve tests

* Fix some more tests

* Add output dataclasses

* Make more tests pass

* Add print statements for debuggin

* Add tests for image classification

* Add PerceiverClassifierOutput

* More improvements

* Make more tests pass for the optical flow model

* Make style & quality

* Small improvements

* Don't support training for optical flow model for now

* Fix _prepare_for_class for tests

* Make more tests pass, add some docs

* Add multimodal model to tests

* Minor fixes

* Fix tests

* Improve conversion script

* Make fixup

* Remove pos_dim argument

* Fix device issue

* Potential fix for OOM

* Revert previous commit

* Fix test_initialization

* Add print statements for debugging

* Fix print statement

* Add print statement

* Add print statement

* Add print statement

* Add print statement

* Add print statement

* Add print statement

* Remove need for output_shape

* Comment out output_shape

* Remove unnecessary code

* Improve docs

* Fix make fixup

* Remove PerceiverTextProcessor from init

* Improve docs

* Small improvement

* Apply first batch of suggestions from code review

* Apply more suggestions from code review

* Update docstrings

* Define dicts beforehand for readability

* Rename task to architecture in conversion script, include PerceiverModel in tests

* Add print statements for debugging

* Fix tests on GPU

* Remove preprocessors, postprocessors and decoders from main init

* Add integration test

* Fix docs

* Replace einops by torch

* Update for new docs frontend

* Rename PerceiverForImageClassification

* Improve docs

* Improve docs

* Improve docs of PerceiverModel

* Fix some more tests

* Improve center_crop

* Add PerceiverForSequenceClassification

* Small improvements

* Fix tests

* Add integration test for optical flow model

* Clean up

* Add tests for tokenizer

* Fix tokenizer by adding special tokens properly

* Fix CI
parent 961732c2
...@@ -286,6 +286,7 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. ...@@ -286,6 +286,7 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu. 1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel. 1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu. 1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
1. **[Perceiver IO](https://huggingface.co/docs/transformers/model_doc/perceiver)** (from Deepmind) released with the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen. 1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen.
1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. 1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius. 1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
......
...@@ -265,6 +265,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는 ...@@ -265,6 +265,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu. 1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel. 1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu. 1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
1. **[Perceiver IO](https://huggingface.co/docs/transformers/model_doc/perceiver)** (from Deepmind) released with the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen. 1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen.
1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. 1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius. 1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
......
...@@ -289,6 +289,7 @@ conda install -c huggingface transformers ...@@ -289,6 +289,7 @@ conda install -c huggingface transformers
1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (来自 Microsoft Research) 伴随论文 [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) 由 Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu 发布。 1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (来自 Microsoft Research) 伴随论文 [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) 由 Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu 发布。
1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (来自 Google AI) 伴随论文 [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) 由 Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel 发布。 1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (来自 Google AI) 伴随论文 [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) 由 Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel 发布。
1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (来自 Google) 伴随论文 [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) 由 Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu 发布。 1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (来自 Google) 伴随论文 [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) 由 Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu 发布。
1. **[Perceiver IO](https://huggingface.co/docs/transformers/model_doc/perceiver)** (来自 Deepmind) 伴随论文 [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) 由 Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira 发布。
1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (来自 VinAI Research) 伴随论文 [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) 由 Dat Quoc Nguyen and Anh Tuan Nguyen 发布。 1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (来自 VinAI Research) 伴随论文 [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) 由 Dat Quoc Nguyen and Anh Tuan Nguyen 发布。
1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (来自 Microsoft Research) 伴随论文 [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) 由 Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou 发布。 1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (来自 Microsoft Research) 伴随论文 [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) 由 Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou 发布。
1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (来自 NVIDIA) 伴随论文 [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) 由 Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius 发布。 1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (来自 NVIDIA) 伴随论文 [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) 由 Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius 发布。
......
...@@ -301,6 +301,7 @@ conda install -c huggingface transformers ...@@ -301,6 +301,7 @@ conda install -c huggingface transformers
1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu. 1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel. 1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu. 1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
1. **[Perceiver IO](https://huggingface.co/docs/transformers/model_doc/perceiver)** (from Deepmind) released with the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen. 1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen.
1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. 1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius. 1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
......
...@@ -146,6 +146,7 @@ conversion utilities for the following models. ...@@ -146,6 +146,7 @@ conversion utilities for the following models.
1. **[MPNet](model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu. 1. **[MPNet](model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
1. **[MT5](model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel. 1. **[MT5](model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
1. **[Pegasus](model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu. 1. **[Pegasus](model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
1. **[Perceiver IO](model_doc/perceiver)** (from Deepmind) released with the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
1. **[PhoBERT](model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen. 1. **[PhoBERT](model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen.
1. **[ProphetNet](model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. 1. **[ProphetNet](model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
1. **[QDQBert](model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius. 1. **[QDQBert](model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
...@@ -234,6 +235,7 @@ Flax), PyTorch, and/or TensorFlow. ...@@ -234,6 +235,7 @@ Flax), PyTorch, and/or TensorFlow.
| OpenAI GPT | ✅ | ✅ | ✅ | ✅ | ❌ | | OpenAI GPT | ✅ | ✅ | ✅ | ✅ | ❌ |
| OpenAI GPT-2 | ✅ | ✅ | ✅ | ✅ | ✅ | | OpenAI GPT-2 | ✅ | ✅ | ✅ | ✅ | ✅ |
| Pegasus | ✅ | ✅ | ✅ | ✅ | ✅ | | Pegasus | ✅ | ✅ | ✅ | ✅ | ✅ |
| Perceiver | ✅ | ❌ | ✅ | ❌ | ❌ |
| ProphetNet | ✅ | ❌ | ✅ | ❌ | ❌ | | ProphetNet | ✅ | ❌ | ✅ | ❌ | ❌ |
| QDQBert | ❌ | ❌ | ✅ | ❌ | ❌ | | QDQBert | ❌ | ❌ | ✅ | ❌ | ❌ |
| RAG | ✅ | ❌ | ✅ | ✅ | ❌ | | RAG | ✅ | ❌ | ✅ | ✅ | ❌ |
......
Transformers
=======================================================================================================================
State-of-the-art Natural Language Processing for Jax, Pytorch and TensorFlow
🤗 Transformers (formerly known as `pytorch-transformers` and `pytorch-pretrained-bert`) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet...) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between Jax,
PyTorch and TensorFlow.
This is the documentation of our repository `transformers <https://github.com/huggingface/transformers>`__. You can
also follow our `online course <https://huggingface.co/course>`__ that teaches how to use this library, as well as the
other libraries developed by Hugging Face and the Hub.
If you are looking for custom support from the Hugging Face team
-----------------------------------------------------------------------------------------------------------------------
.. raw:: html
<a target="_blank" href="https://huggingface.co/support">
<img alt="HuggingFace Expert Acceleration Program" src="https://huggingface.co/front/thumbnails/support.png" style="max-width: 600px; border: 1px solid #eee; border-radius: 4px; box-shadow: 0 1px 2px 0 rgba(0, 0, 0, 0.05);">
</a><br>
Features
-----------------------------------------------------------------------------------------------------------------------
- High performance on NLU and NLG tasks
- Low barrier to entry for educators and practitioners
State-of-the-art NLP for everyone:
- Deep learning researchers
- Hands-on practitioners
- AI/ML/NLP teachers and educators
..
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
Lower compute costs, smaller carbon footprint:
- Researchers can share trained models instead of always retraining
- Practitioners can reduce compute time and production costs
- 8 architectures with over 30 pretrained models, some in more than 100 languages
Choose the right framework for every part of a model's lifetime:
- Train state-of-the-art models in 3 lines of code
- Deep interoperability between Jax, Pytorch and TensorFlow models
- Move a single model between Jax/PyTorch/TensorFlow frameworks at will
- Seamlessly pick the right framework for training, evaluation, production
The support for Jax is still experimental (with a few models right now), expect to see it grow in the coming months!
`All the model checkpoints <https://huggingface.co/models>`__ are seamlessly integrated from the huggingface.co `model
hub <https://huggingface.co>`__ where they are uploaded directly by `users <https://huggingface.co/users>`__ and
`organizations <https://huggingface.co/organizations>`__.
Current number of checkpoints: |checkpoints|
.. |checkpoints| image:: https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/models&color=brightgreen
Contents
-----------------------------------------------------------------------------------------------------------------------
The documentation is organized in five parts:
- **GET STARTED** contains a quick tour, the installation instructions and some useful information about our philosophy
and a glossary.
- **USING 🤗 TRANSFORMERS** contains general tutorials on how to use the library.
- **ADVANCED GUIDES** contains more advanced guides that are more specific to a given script or part of the library.
- **RESEARCH** focuses on tutorials that have less to do with how to use the library but more about general research in
transformers model
- The three last section contain the documentation of each public class and function, grouped in:
- **MAIN CLASSES** for the main classes exposing the important APIs of the library.
- **MODELS** for the classes and functions related to each model implemented in the library.
- **INTERNAL HELPERS** for the classes and functions we use internally.
The library currently contains Jax, PyTorch and Tensorflow implementations, pretrained model weights, usage scripts and
conversion utilities for the following models.
Supported models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
..
This list is updated automatically from the README with `make fix-copies`. Do not update manually!
1. :doc:`ALBERT <model_doc/albert>` (from Google Research and the Toyota Technological Institute at Chicago) released
with the paper `ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
<https://arxiv.org/abs/1909.11942>`__, by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush
Sharma, Radu Soricut.
2. :doc:`BART <model_doc/bart>` (from Facebook) released with the paper `BART: Denoising Sequence-to-Sequence
Pre-training for Natural Language Generation, Translation, and Comprehension
<https://arxiv.org/pdf/1910.13461.pdf>`__ by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman
Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
3. :doc:`BARThez <model_doc/barthez>` (from École polytechnique) released with the paper `BARThez: a Skilled Pretrained
French Sequence-to-Sequence Model <https://arxiv.org/abs/2010.12321>`__ by Moussa Kamal Eddine, Antoine J.-P.
Tixier, Michalis Vazirgiannis.
4. :doc:`BARTpho <model_doc/bartpho>` (from VinAI Research) released with the paper `BARTpho: Pre-trained
Sequence-to-Sequence Models for Vietnamese <https://arxiv.org/abs/2109.09701>`__ by Nguyen Luong Tran, Duong Minh Le
and Dat Quoc Nguyen.
5. :doc:`BEiT <model_doc/beit>` (from Microsoft) released with the paper `BEiT: BERT Pre-Training of Image Transformers
<https://arxiv.org/abs/2106.08254>`__ by Hangbo Bao, Li Dong, Furu Wei.
6. :doc:`BERT <model_doc/bert>` (from Google) released with the paper `BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding <https://arxiv.org/abs/1810.04805>`__ by Jacob Devlin, Ming-Wei Chang,
Kenton Lee and Kristina Toutanova.
7. :doc:`BERTweet <model_doc/bertweet>` (from VinAI Research) released with the paper `BERTweet: A pre-trained language
model for English Tweets <https://aclanthology.org/2020.emnlp-demos.2/>`__ by Dat Quoc Nguyen, Thanh Vu and Anh Tuan
Nguyen.
8. :doc:`BERT For Sequence Generation <model_doc/bertgeneration>` (from Google) released with the paper `Leveraging
Pre-trained Checkpoints for Sequence Generation Tasks <https://arxiv.org/abs/1907.12461>`__ by Sascha Rothe, Shashi
Narayan, Aliaksei Severyn.
9. :doc:`BigBird-RoBERTa <model_doc/bigbird>` (from Google Research) released with the paper `Big Bird: Transformers
for Longer Sequences <https://arxiv.org/abs/2007.14062>`__ by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua
Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed.
10. :doc:`BigBird-Pegasus <model_doc/bigbird_pegasus>` (from Google Research) released with the paper `Big Bird:
Transformers for Longer Sequences <https://arxiv.org/abs/2007.14062>`__ by Manzil Zaheer, Guru Guruganesh, Avinava
Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr
Ahmed.
11. :doc:`Blenderbot <model_doc/blenderbot>` (from Facebook) released with the paper `Recipes for building an
open-domain chatbot <https://arxiv.org/abs/2004.13637>`__ by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary
Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
12. :doc:`BlenderbotSmall <model_doc/blenderbot_small>` (from Facebook) released with the paper `Recipes for building
an open-domain chatbot <https://arxiv.org/abs/2004.13637>`__ by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju,
Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
13. :doc:`BORT <model_doc/bort>` (from Alexa) released with the paper `Optimal Subarchitecture Extraction For BERT
<https://arxiv.org/abs/2010.10499>`__ by Adrian de Wynter and Daniel J. Perry.
14. :doc:`ByT5 <model_doc/byt5>` (from Google Research) released with the paper `ByT5: Towards a token-free future with
pre-trained byte-to-byte models <https://arxiv.org/abs/2105.13626>`__ by Linting Xue, Aditya Barua, Noah Constant,
Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel.
15. :doc:`CamemBERT <model_doc/camembert>` (from Inria/Facebook/Sorbonne) released with the paper `CamemBERT: a Tasty
French Language Model <https://arxiv.org/abs/1911.03894>`__ by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz
Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
16. :doc:`CANINE <model_doc/canine>` (from Google Research) released with the paper `CANINE: Pre-training an Efficient
Tokenization-Free Encoder for Language Representation <https://arxiv.org/abs/2103.06874>`__ by Jonathan H. Clark,
Dan Garrette, Iulia Turc, John Wieting.
17. :doc:`CLIP <model_doc/clip>` (from OpenAI) released with the paper `Learning Transferable Visual Models From
Natural Language Supervision <https://arxiv.org/abs/2103.00020>`__ by Alec Radford, Jong Wook Kim, Chris Hallacy,
Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen
Krueger, Ilya Sutskever.
18. :doc:`ConvBERT <model_doc/convbert>` (from YituTech) released with the paper `ConvBERT: Improving BERT with
Span-based Dynamic Convolution <https://arxiv.org/abs/2008.02496>`__ by Zihang Jiang, Weihao Yu, Daquan Zhou,
Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
19. :doc:`CPM <model_doc/cpm>` (from Tsinghua University) released with the paper `CPM: A Large-scale Generative
Chinese Pre-trained Language Model <https://arxiv.org/abs/2012.00413>`__ by Zhengyan Zhang, Xu Han, Hao Zhou, Pei
Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng,
Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang,
Juanzi Li, Xiaoyan Zhu, Maosong Sun.
20. :doc:`CTRL <model_doc/ctrl>` (from Salesforce) released with the paper `CTRL: A Conditional Transformer Language
Model for Controllable Generation <https://arxiv.org/abs/1909.05858>`__ by Nitish Shirish Keskar*, Bryan McCann*,
Lav R. Varshney, Caiming Xiong and Richard Socher.
21. :doc:`DeBERTa <model_doc/deberta>` (from Microsoft) released with the paper `DeBERTa: Decoding-enhanced BERT with
Disentangled Attention <https://arxiv.org/abs/2006.03654>`__ by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu
Chen.
22. :doc:`DeBERTa-v2 <model_doc/deberta_v2>` (from Microsoft) released with the paper `DeBERTa: Decoding-enhanced BERT
with Disentangled Attention <https://arxiv.org/abs/2006.03654>`__ by Pengcheng He, Xiaodong Liu, Jianfeng Gao,
Weizhu Chen.
23. :doc:`DeiT <model_doc/deit>` (from Facebook) released with the paper `Training data-efficient image transformers &
distillation through attention <https://arxiv.org/abs/2012.12877>`__ by Hugo Touvron, Matthieu Cord, Matthijs
Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
24. :doc:`DETR <model_doc/detr>` (from Facebook) released with the paper `End-to-End Object Detection with Transformers
<https://arxiv.org/abs/2005.12872>`__ by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier,
Alexander Kirillov, Sergey Zagoruyko.
25. :doc:`DialoGPT <model_doc/dialogpt>` (from Microsoft Research) released with the paper `DialoGPT: Large-Scale
Generative Pre-training for Conversational Response Generation <https://arxiv.org/abs/1911.00536>`__ by Yizhe
Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
26. :doc:`DistilBERT <model_doc/distilbert>` (from HuggingFace), released together with the paper `DistilBERT, a
distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`__ by Victor
Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into `DistilGPT2
<https://github.com/huggingface/transformers/tree/master/examples/distillation>`__, RoBERTa into `DistilRoBERTa
<https://github.com/huggingface/transformers/tree/master/examples/distillation>`__, Multilingual BERT into
`DistilmBERT <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__ and a German
version of DistilBERT.
27. :doc:`DPR <model_doc/dpr>` (from Facebook) released with the paper `Dense Passage Retrieval for Open-Domain
Question Answering <https://arxiv.org/abs/2004.04906>`__ by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick
Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
28. :doc:`EncoderDecoder <model_doc/encoderdecoder>` (from Google Research) released with the paper `Leveraging
Pre-trained Checkpoints for Sequence Generation Tasks <https://arxiv.org/abs/1907.12461>`__ by Sascha Rothe, Shashi
Narayan, Aliaksei Severyn.
29. :doc:`ELECTRA <model_doc/electra>` (from Google Research/Stanford University) released with the paper `ELECTRA:
Pre-training text encoders as discriminators rather than generators <https://arxiv.org/abs/2003.10555>`__ by Kevin
Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
30. :doc:`FlauBERT <model_doc/flaubert>` (from CNRS) released with the paper `FlauBERT: Unsupervised Language Model
Pre-training for French <https://arxiv.org/abs/1912.05372>`__ by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne,
Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
31. :doc:`FNet <model_doc/fnet>` (from Google Research) released with the paper `FNet: Mixing Tokens with Fourier
Transforms <https://arxiv.org/abs/2105.03824>`__ by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago
Ontanon.
32. :doc:`Funnel Transformer <model_doc/funnel>` (from CMU/Google Brain) released with the paper `Funnel-Transformer:
Filtering out Sequential Redundancy for Efficient Language Processing <https://arxiv.org/abs/2006.03236>`__ by
Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
33. :doc:`GPT <model_doc/gpt>` (from OpenAI) released with the paper `Improving Language Understanding by Generative
Pre-Training <https://blog.openai.com/language-unsupervised/>`__ by Alec Radford, Karthik Narasimhan, Tim Salimans
and Ilya Sutskever.
34. :doc:`GPT-2 <model_doc/gpt2>` (from OpenAI) released with the paper `Language Models are Unsupervised Multitask
Learners <https://blog.openai.com/better-language-models/>`__ by Alec Radford*, Jeffrey Wu*, Rewon Child, David
Luan, Dario Amodei** and Ilya Sutskever**.
35. :doc:`GPT-J <model_doc/gptj>` (from EleutherAI) released in the repository `kingoflolz/mesh-transformer-jax
<https://github.com/kingoflolz/mesh-transformer-jax/>`__ by Ben Wang and Aran Komatsuzaki.
36. :doc:`GPT Neo <model_doc/gpt_neo>` (from EleutherAI) released in the repository `EleutherAI/gpt-neo
<https://github.com/EleutherAI/gpt-neo>`__ by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy.
37. :doc:`Hubert <model_doc/hubert>` (from Facebook) released with the paper `HuBERT: Self-Supervised Speech
Representation Learning by Masked Prediction of Hidden Units <https://arxiv.org/abs/2106.07447>`__ by Wei-Ning Hsu,
Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
38. :doc:`I-BERT <model_doc/ibert>` (from Berkeley) released with the paper `I-BERT: Integer-only BERT Quantization
<https://arxiv.org/abs/2101.01321>`__ by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
39. `ImageGPT <https://huggingface.co/transformers/master/model_doc/imagegpt.html>`__ (from OpenAI) released with the
paper `Generative Pretraining from Pixels <https://openai.com/blog/image-gpt/>`__ by Mark Chen, Alec Radford, Rewon
Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
40. :doc:`LayoutLM <model_doc/layoutlm>` (from Microsoft Research Asia) released with the paper `LayoutLM: Pre-training
of Text and Layout for Document Image Understanding <https://arxiv.org/abs/1912.13318>`__ by Yiheng Xu, Minghao Li,
Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
41. :doc:`LayoutLMv2 <model_doc/layoutlmv2>` (from Microsoft Research Asia) released with the paper `LayoutLMv2:
Multi-modal Pre-training for Visually-Rich Document Understanding <https://arxiv.org/abs/2012.14740>`__ by Yang Xu,
Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min
Zhang, Lidong Zhou.
42. :doc:`LayoutXLM <model_doc/layoutlmv2>` (from Microsoft Research Asia) released with the paper `LayoutXLM:
Multimodal Pre-training for Multilingual Visually-rich Document Understanding <https://arxiv.org/abs/2104.08836>`__
by Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei.
43. :doc:`LED <model_doc/led>` (from AllenAI) released with the paper `Longformer: The Long-Document Transformer
<https://arxiv.org/abs/2004.05150>`__ by Iz Beltagy, Matthew E. Peters, Arman Cohan.
44. :doc:`Longformer <model_doc/longformer>` (from AllenAI) released with the paper `Longformer: The Long-Document
Transformer <https://arxiv.org/abs/2004.05150>`__ by Iz Beltagy, Matthew E. Peters, Arman Cohan.
45. :doc:`LUKE <model_doc/luke>` (from Studio Ousia) released with the paper `LUKE: Deep Contextualized Entity
Representations with Entity-aware Self-attention <https://arxiv.org/abs/2010.01057>`__ by Ikuya Yamada, Akari Asai,
Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto.
46. :doc:`LXMERT <model_doc/lxmert>` (from UNC Chapel Hill) released with the paper `LXMERT: Learning Cross-Modality
Encoder Representations from Transformers for Open-Domain Question Answering <https://arxiv.org/abs/1908.07490>`__
by Hao Tan and Mohit Bansal.
47. :doc:`M2M100 <model_doc/m2m_100>` (from Facebook) released with the paper `Beyond English-Centric Multilingual
Machine Translation <https://arxiv.org/abs/2010.11125>`__ by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma,
Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal,
Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
48. :doc:`MarianMT <model_doc/marian>` Machine translation models trained using `OPUS <http://opus.nlpl.eu/>`__ data by
Jörg Tiedemann. The `Marian Framework <https://marian-nmt.github.io/>`__ is being developed by the Microsoft
Translator Team.
49. :doc:`MBart <model_doc/mbart>` (from Facebook) released with the paper `Multilingual Denoising Pre-training for
Neural Machine Translation <https://arxiv.org/abs/2001.08210>`__ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li,
Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
50. :doc:`MBart-50 <model_doc/mbart>` (from Facebook) released with the paper `Multilingual Translation with Extensible
Multilingual Pretraining and Finetuning <https://arxiv.org/abs/2008.00401>`__ by Yuqing Tang, Chau Tran, Xian Li,
Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
51. :doc:`Megatron-BERT <model_doc/megatron_bert>` (from NVIDIA) released with the paper `Megatron-LM: Training
Multi-Billion Parameter Language Models Using Model Parallelism <https://arxiv.org/abs/1909.08053>`__ by Mohammad
Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
52. :doc:`Megatron-GPT2 <model_doc/megatron_gpt2>` (from NVIDIA) released with the paper `Megatron-LM: Training
Multi-Billion Parameter Language Models Using Model Parallelism <https://arxiv.org/abs/1909.08053>`__ by Mohammad
Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
53. :doc:`MPNet <model_doc/mpnet>` (from Microsoft Research) released with the paper `MPNet: Masked and Permuted
Pre-training for Language Understanding <https://arxiv.org/abs/2004.09297>`__ by Kaitao Song, Xu Tan, Tao Qin,
Jianfeng Lu, Tie-Yan Liu.
54. :doc:`MT5 <model_doc/mt5>` (from Google AI) released with the paper `mT5: A massively multilingual pre-trained
text-to-text transformer <https://arxiv.org/abs/2010.11934>`__ by Linting Xue, Noah Constant, Adam Roberts, Mihir
Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
55. :doc:`Pegasus <model_doc/pegasus>` (from Google) released with the paper `PEGASUS: Pre-training with Extracted
Gap-sentences for Abstractive Summarization <https://arxiv.org/abs/1912.08777>`__ by Jingqing Zhang, Yao Zhao,
Mohammad Saleh and Peter J. Liu.
56. `Perceiver IO <https://huggingface.co/transformers/model_doc/master/perceiver.html>`__ (from Deepmind) released
with the paper `Perceiver IO: A General Architecture for Structured Inputs & Outputs
<https://arxiv.org/abs/2107.14795>`__ by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch,
Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M.
Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
57. :doc:`PhoBERT <model_doc/phobert>` (from VinAI Research) released with the paper `PhoBERT: Pre-trained language
models for Vietnamese <https://www.aclweb.org/anthology/2020.findings-emnlp.92/>`__ by Dat Quoc Nguyen and Anh Tuan
Nguyen.
58. :doc:`ProphetNet <model_doc/prophetnet>` (from Microsoft Research) released with the paper `ProphetNet: Predicting
Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi,
Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
59. :doc:`QDQBert <model_doc/qdqbert>` (from NVIDIA) released with the paper `Integer Quantization for Deep Learning
Inference: Principles and Empirical Evaluation <https://arxiv.org/abs/2004.09602>`__ by Hao Wu, Patrick Judd,
Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
60. :doc:`Reformer <model_doc/reformer>` (from Google Research) released with the paper `Reformer: The Efficient
Transformer <https://arxiv.org/abs/2001.04451>`__ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
61. :doc:`RemBERT <model_doc/rembert>` (from Google Research) released with the paper `Rethinking embedding coupling in
pre-trained language models <https://arxiv.org/pdf/2010.12821.pdf>`__ by Hyung Won Chung, Thibault Févry, Henry
Tsai, M. Johnson, Sebastian Ruder.
62. :doc:`RoBERTa <model_doc/roberta>` (from Facebook), released together with the paper a `Robustly Optimized BERT
Pretraining Approach <https://arxiv.org/abs/1907.11692>`__ by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar
Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
63. :doc:`RoFormer <model_doc/roformer>` (from ZhuiyiTechnology), released together with the paper a `RoFormer:
Enhanced Transformer with Rotary Position Embedding <https://arxiv.org/pdf/2104.09864v1.pdf>`__ by Jianlin Su and
Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
64. :doc:`SegFormer <model_doc/segformer>` (from NVIDIA) released with the paper `SegFormer: Simple and Efficient
Design for Semantic Segmentation with Transformers <https://arxiv.org/abs/2105.15203>`__ by Enze Xie, Wenhai Wang,
Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
65. :doc:`SEW <model_doc/sew>` (from ASAPP) released with the paper `Performance-Efficiency Trade-offs in Unsupervised
Pre-training for Speech Recognition <https://arxiv.org/abs/2109.06870>`__ by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu
Han, Kilian Q. Weinberger, Yoav Artzi.
66. :doc:`SEW-D <model_doc/sew_d>` (from ASAPP) released with the paper `Performance-Efficiency Trade-offs in
Unsupervised Pre-training for Speech Recognition <https://arxiv.org/abs/2109.06870>`__ by Felix Wu, Kwangyoun Kim,
Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
67. :doc:`SpeechToTextTransformer <model_doc/speech_to_text>` (from Facebook), released together with the paper
`fairseq S2T: Fast Speech-to-Text Modeling with fairseq <https://arxiv.org/abs/2010.05171>`__ by Changhan Wang, Yun
Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino.
68. :doc:`SpeechToTextTransformer2 <model_doc/speech_to_text_2>` (from Facebook), released together with the paper
`Large-Scale Self- and Semi-Supervised Learning for Speech Translation <https://arxiv.org/abs/2104.06678>`__ by
Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.
69. :doc:`Splinter <model_doc/splinter>` (from Tel Aviv University), released together with the paper `Few-Shot
Question Answering by Pretraining Span Selection <https://arxiv.org/abs/2101.00438>`__ by Ori Ram, Yuval Kirstain,
Jonathan Berant, Amir Globerson, Omer Levy.
70. :doc:`SqueezeBert <model_doc/squeezebert>` (from Berkeley) released with the paper `SqueezeBERT: What can computer
vision teach NLP about efficient neural networks? <https://arxiv.org/abs/2006.11316>`__ by Forrest N. Iandola,
Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
71. :doc:`T5 <model_doc/t5>` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a
Unified Text-to-Text Transformer <https://arxiv.org/abs/1910.10683>`__ by Colin Raffel and Noam Shazeer and Adam
Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
72. :doc:`T5v1.1 <model_doc/t5v1.1>` (from Google AI) released in the repository
`google-research/text-to-text-transfer-transformer
<https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511>`__ by
Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi
Zhou and Wei Li and Peter J. Liu.
73. :doc:`TAPAS <model_doc/tapas>` (from Google AI) released with the paper `TAPAS: Weakly Supervised Table Parsing via
Pre-training <https://arxiv.org/abs/2004.02349>`__ by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller,
Francesco Piccinno and Julian Martin Eisenschlos.
74. :doc:`Transformer-XL <model_doc/transformerxl>` (from Google/CMU) released with the paper `Transformer-XL:
Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`__ by Zihang Dai*,
Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
75. :doc:`TrOCR <model_doc/trocr>` (from Microsoft), released together with the paper `TrOCR: Transformer-based Optical
Character Recognition with Pre-trained Models <https://arxiv.org/abs/2109.10282>`__ by Minghao Li, Tengchao Lv, Lei
Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
76. :doc:`UniSpeech <model_doc/unispeech>` (from Microsoft Research) released with the paper `UniSpeech: Unified Speech
Representation Learning with Labeled and Unlabeled Data <https://arxiv.org/abs/2101.07597>`__ by Chengyi Wang, Yu
Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
77. :doc:`UniSpeechSat <model_doc/unispeech_sat>` (from Microsoft Research) released with the paper `UNISPEECH-SAT:
UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING <https://arxiv.org/abs/2110.05752>`__ by
Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li,
Xiangzhan Yu.
78. :doc:`Vision Transformer (ViT) <model_doc/vit>` (from Google AI) released with the paper `An Image is Worth 16x16
Words: Transformers for Image Recognition at Scale <https://arxiv.org/abs/2010.11929>`__ by Alexey Dosovitskiy,
Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias
Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
79. :doc:`VisualBERT <model_doc/visual_bert>` (from UCLA NLP) released with the paper `VisualBERT: A Simple and
Performant Baseline for Vision and Language <https://arxiv.org/pdf/1908.03557>`__ by Liunian Harold Li, Mark
Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
80. :doc:`Wav2Vec2 <model_doc/wav2vec2>` (from Facebook AI) released with the paper `wav2vec 2.0: A Framework for
Self-Supervised Learning of Speech Representations <https://arxiv.org/abs/2006.11477>`__ by Alexei Baevski, Henry
Zhou, Abdelrahman Mohamed, Michael Auli.
81. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
Pretraining <https://arxiv.org/abs/1901.07291>`__ by Guillaume Lample and Alexis Conneau.
82. :doc:`XLM-ProphetNet <model_doc/xlmprophetnet>` (from Microsoft Research) released with the paper `ProphetNet:
Predicting Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan,
Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
83. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`__ by Alexis Conneau*, Kartikay
Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke
Zettlemoyer and Veselin Stoyanov.
84. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `​XLNet: Generalized Autoregressive
Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`__ by Zhilin Yang*, Zihang Dai*, Yiming
Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
85. :doc:`XLSR-Wav2Vec2 <model_doc/xlsr_wav2vec2>` (from Facebook AI) released with the paper `Unsupervised
Cross-Lingual Representation Learning For Speech Recognition <https://arxiv.org/abs/2006.13979>`__ by Alexis
Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli.
Supported frameworks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The table below represents the current support in the library for each of those models, whether they have a Python
tokenizer (called "slow"). A "fast" tokenizer backed by the 🤗 Tokenizers library, whether they have support in Jax (via
Flax), PyTorch, and/or TensorFlow.
..
This table is updated automatically from the auto modules with `make fix-copies`. Do not update manually!
.. rst-class:: center-aligned-table
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Model | Tokenizer slow | Tokenizer fast | PyTorch support | TensorFlow support | Flax Support |
+=============================+================+================+=================+====================+==============+
| ALBERT | ✅ | ✅ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| BART | ✅ | ✅ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| BEiT | ❌ | ❌ | ✅ | ❌ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| BERT | ✅ | ✅ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Bert Generation | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| BigBird | ✅ | ✅ | ✅ | ❌ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| BigBirdPegasus | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Blenderbot | ✅ | ✅ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| BlenderbotSmall | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| CamemBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Canine | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| CLIP | ✅ | ✅ | ✅ | ❌ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| ConvBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| CTRL | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| DeBERTa | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| DeBERTa-v2 | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| DeiT | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| DETR | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| DistilBERT | ✅ | ✅ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| DPR | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| ELECTRA | ✅ | ✅ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Encoder decoder | ❌ | ❌ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| FairSeq Machine-Translation | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| FlauBERT | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| FNet | ✅ | ✅ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Funnel Transformer | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| GPT Neo | ❌ | ❌ | ✅ | ❌ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| GPT-J | ❌ | ❌ | ✅ | ❌ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Hubert | ❌ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| I-BERT | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| ImageGPT | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| LayoutLM | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| LayoutLMv2 | ✅ | ✅ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| LED | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Longformer | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| LUKE | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| LXMERT | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| M2M100 | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Marian | ✅ | ❌ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| mBART | ✅ | ✅ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| MegatronBert | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| MobileBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| MPNet | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| mT5 | ✅ | ✅ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| OpenAI GPT | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| OpenAI GPT-2 | ✅ | ✅ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Pegasus | ✅ | ✅ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Perceiver | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| ProphetNet | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| QDQBert | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| RAG | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Reformer | ✅ | ✅ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| RemBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| RetriBERT | ✅ | ✅ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| RoBERTa | ✅ | ✅ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| RoFormer | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| SegFormer | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| SEW | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| SEW-D | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Speech Encoder decoder | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Speech2Text | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Speech2Text2 | ✅ | ❌ | ❌ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Splinter | ✅ | ✅ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| SqueezeBERT | ✅ | ✅ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| T5 | ✅ | ✅ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| TAPAS | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Transformer-XL | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| TrOCR | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| UniSpeech | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| UniSpeechSat | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Vision Encoder decoder | ❌ | ❌ | ✅ | ❌ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| VisionTextDualEncoder | ❌ | ❌ | ✅ | ❌ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| VisualBert | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| ViT | ❌ | ❌ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Wav2Vec2 | ✅ | ❌ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| XLM | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| XLM-RoBERTa | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| XLMProphetNet | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| XLNet | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
.. toctree::
:maxdepth: 2
:caption: Get started
quicktour
installation
philosophy
glossary
.. toctree::
:maxdepth: 2
:caption: Using 🤗 Transformers
task_summary
model_summary
preprocessing
training
model_sharing
tokenizer_summary
multilingual
.. toctree::
:maxdepth: 2
:caption: Advanced guides
pretrained_models
examples
troubleshooting
custom_datasets
notebooks
sagemaker
community
converting_tensorflow_models
migration
contributing
add_new_model
add_new_pipeline
fast_tokenizers
performance
parallelism
testing
debugging
serialization
pr_checks
.. toctree::
:maxdepth: 2
:caption: Research
bertology
perplexity
benchmarks
.. toctree::
:maxdepth: 2
:caption: Main Classes
main_classes/callback
main_classes/configuration
main_classes/data_collator
main_classes/keras_callbacks
main_classes/logging
main_classes/model
main_classes/optimizer_schedules
main_classes/output
main_classes/pipelines
main_classes/processors
main_classes/tokenizer
main_classes/trainer
main_classes/deepspeed
main_classes/feature_extractor
.. toctree::
:maxdepth: 2
:caption: Models
model_doc/albert
model_doc/auto
model_doc/bart
model_doc/barthez
model_doc/bartpho
model_doc/beit
model_doc/bert
model_doc/bertweet
model_doc/bertgeneration
model_doc/bert_japanese
model_doc/bigbird
model_doc/bigbird_pegasus
model_doc/blenderbot
model_doc/blenderbot_small
model_doc/bort
model_doc/byt5
model_doc/camembert
model_doc/canine
model_doc/clip
model_doc/convbert
model_doc/cpm
model_doc/ctrl
model_doc/deberta
model_doc/deberta_v2
model_doc/deit
model_doc/detr
model_doc/dialogpt
model_doc/distilbert
model_doc/dpr
model_doc/electra
model_doc/encoderdecoder
model_doc/flaubert
model_doc/fnet
model_doc/fsmt
model_doc/funnel
model_doc/herbert
model_doc/ibert
model_doc/imagegpt
model_doc/layoutlm
model_doc/layoutlmv2
model_doc/layoutxlm
model_doc/led
model_doc/longformer
model_doc/luke
model_doc/lxmert
model_doc/marian
model_doc/m2m_100
model_doc/mbart
model_doc/megatron_bert
model_doc/megatron_gpt2
model_doc/mobilebert
model_doc/mpnet
model_doc/mt5
model_doc/gpt
model_doc/gpt2
model_doc/gptj
model_doc/gpt_neo
model_doc/hubert
model_doc/pegasus
model_doc/perceiver
model_doc/phobert
model_doc/prophetnet
model_doc/qdqbert
model_doc/rag
model_doc/reformer
model_doc/rembert
model_doc/retribert
model_doc/roberta
model_doc/roformer
model_doc/segformer
model_doc/sew
model_doc/sew_d
model_doc/speechencoderdecoder
model_doc/speech_to_text
model_doc/speech_to_text_2
model_doc/splinter
model_doc/squeezebert
model_doc/t5
model_doc/t5v1.1
model_doc/tapas
model_doc/transformerxl
model_doc/trocr
model_doc/unispeech
model_doc/unispeech_sat
model_doc/visionencoderdecoder
model_doc/vision_text_dual_encoder
model_doc/vit
model_doc/visual_bert
model_doc/wav2vec2
model_doc/xlm
model_doc/xlmprophetnet
model_doc/xlmroberta
model_doc/xlnet
model_doc/xlsr_wav2vec2
.. toctree::
:maxdepth: 2
:caption: Internal Helpers
internal/modeling_utils
internal/pipelines_utils
internal/tokenization_utils
internal/trainer_utils
internal/generation_utils
internal/file_utils
..
Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
Perceiver
-----------------------------------------------------------------------------------------------------------------------
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The Perceiver IO model was proposed in `Perceiver IO: A General Architecture for Structured Inputs & Outputs
<https://arxiv.org/abs/2107.14795>`__ by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch,
Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M.
Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
Perceiver IO is a generalization of `Perceiver <https://arxiv.org/abs/2103.03206>`__ to handle arbitrary outputs in
addition to arbitrary inputs. The original Perceiver only produced a single classification label. In addition to
classification labels, Perceiver IO can produce (for example) language, optical flow, and multimodal videos with audio.
This is done using the same building blocks as the original Perceiver. The computational complexity of Perceiver IO is
linear in the input and output size and the bulk of the processing occurs in the latent space, allowing us to process
inputs and outputs that are much larger than can be handled by standard Transformers. This means, for example,
Perceiver IO can do BERT-style masked language modeling directly using bytes instead of tokenized inputs.
The abstract from the paper is the following:
*The recently-proposed Perceiver model obtains good results on several domains (images, audio, multimodal, point
clouds) while scaling linearly in compute and memory with the input size. While the Perceiver supports many kinds of
inputs, it can only produce very simple outputs such as class scores. Perceiver IO overcomes this limitation without
sacrificing the original's appealing properties by learning to flexibly query the model's latent space to produce
outputs of arbitrary size and semantics. Perceiver IO still decouples model depth from data size and still scales
linearly with data size, but now with respect to both input and output sizes. The full Perceiver IO model achieves
strong results on tasks with highly structured output spaces, such as natural language and visual understanding,
StarCraft II, and multi-task and multi-modal domains. As highlights, Perceiver IO matches a Transformer-based BERT
baseline on the GLUE language benchmark without the need for input tokenization and achieves state-of-the-art
performance on Sintel optical flow estimation.*
Here's a TLDR explaining how Perceiver works:
The main problem with the self-attention mechanism of the Transformer is that the time and memory requirements scale
quadratically with the sequence length. Hence, models like BERT and RoBERTa are limited to a max sequence length of 512
tokens. Perceiver aims to solve this issue by, instead of performing self-attention on the inputs, perform it on a set
of latent variables, and only use the inputs for cross-attention. In this way, the time and memory requirements don't
depend on the length of the inputs anymore, as one uses a fixed amount of latent variables, like 256 or 512. These are
randomly initialized, after which they are trained end-to-end using backpropagation.
Internally, :class:`~transformers.PerceiverModel` will create the latents, which is a tensor of shape
:obj:`(batch_size, num_latents, d_latents)`. One must provide :obj:`inputs` (which could be text, images, audio, you
name it!) to the model, which it will use to perform cross-attention with the latents. The output of the Perceiver
encoder is a tensor of the same shape. One can then, similar to BERT, convert the last hidden states of the latents to
classification logits by averaging along the sequence dimension, and placing a linear layer on top of that to project
the :obj:`d_latents` to :obj:`num_labels`.
This was the idea of the original Perceiver paper. However, it could only output classification logits. In a follow-up
work, PerceiverIO, they generalized it to let the model also produce outputs of arbitrary size. How, you might ask? The
idea is actually relatively simple: one defines outputs of an arbitrary size, and then applies cross-attention with the
last hidden states of the latents, using the outputs as queries, and the latents as keys and values.
So let's say one wants to perform masked language modeling (BERT-style) with the Perceiver. As the Perceiver's input
length will not have an impact on the computation time of the self-attention layers, one can provide raw bytes,
providing :obj:`inputs` of length 2048 to the model. If one now masks out certain of these 2048 tokens, one can define
the :obj:`outputs` as being of shape: :obj:`(batch_size, 2048, 768)`. Next, one performs cross-attention with the final
hidden states of the latents to update the :obj:`outputs` tensor. After cross-attention, one still has a tensor of
shape :obj:`(batch_size, 2048, 768)`. One can then place a regular language modeling head on top, to project the last
dimension to the vocabulary size of the model, i.e. creating logits of shape :obj:`(batch_size, 2048, 262)` (as
Perceiver uses a vocabulary size of 262 byte IDs).
This model was contributed by `<nielsr> <https://huggingface.co/nielsr>`__. The original code can be found `here
<https://github.com/deepmind/deepmind-research/tree/master/perceiver>`__.
Perceiver specific outputs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverModelOutput
:members:
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverDecoderOutput
:members:
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverMaskedLMOutput
:members:
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput
:members:
PerceiverConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverConfig
:members:
PerceiverTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverTokenizer
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
create_token_type_ids_from_sequences, save_vocabulary
PerceiverFeatureExtractor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverFeatureExtractor
:members:
PerceiverTextPreprocessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverTextPreprocessor
:members:
PerceiverImagePreprocessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor
:members:
PerceiverOneHotPreprocessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverOneHotPreprocessor
:members:
PerceiverAudioPreprocessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverAudioPreprocessor
:members:
PerceiverMultimodalPreprocessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPreprocessor
:members:
PerceiverProjectionPostprocessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverProjectionPostprocessor
:members:
PerceiverAudioPostprocessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverAudioPostprocessor
:members:
PerceiverClassificationPostprocessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverClassificationPostprocessor
:members:
PerceiverMultimodalPostprocessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPostprocessor
:members:
PerceiverModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverModel
:members: forward
PerceiverForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverForMaskedLM
:members: forward
PerceiverForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverForSequenceClassification
:members: forward
PerceiverForImageClassificationLearned
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverForImageClassificationLearned
:members: forward
PerceiverForImageClassificationFourier
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverForImageClassificationFourier
:members: forward
PerceiverForImageClassificationConvProcessing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverForImageClassificationConvProcessing
:members: forward
PerceiverForOpticalFlow
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverForOpticalFlow
:members: forward
PerceiverForMultimodalAutoencoding
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverForMultimodalAutoencoding
:members: forward
...@@ -253,6 +253,7 @@ _import_structure = { ...@@ -253,6 +253,7 @@ _import_structure = {
"models.mt5": ["MT5Config"], "models.mt5": ["MT5Config"],
"models.openai": ["OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP", "OpenAIGPTConfig", "OpenAIGPTTokenizer"], "models.openai": ["OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP", "OpenAIGPTConfig", "OpenAIGPTTokenizer"],
"models.pegasus": ["PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP", "PegasusConfig", "PegasusTokenizer"], "models.pegasus": ["PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP", "PegasusConfig", "PegasusTokenizer"],
"models.perceiver": ["PERCEIVER_PRETRAINED_CONFIG_ARCHIVE_MAP", "PerceiverConfig", "PerceiverTokenizer"],
"models.phobert": ["PhobertTokenizer"], "models.phobert": ["PhobertTokenizer"],
"models.prophetnet": ["PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP", "ProphetNetConfig", "ProphetNetTokenizer"], "models.prophetnet": ["PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP", "ProphetNetConfig", "ProphetNetTokenizer"],
"models.qdqbert": ["QDQBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "QDQBertConfig"], "models.qdqbert": ["QDQBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "QDQBertConfig"],
...@@ -502,6 +503,7 @@ if is_vision_available(): ...@@ -502,6 +503,7 @@ if is_vision_available():
_import_structure["models.layoutlmv2"].append("LayoutLMv2FeatureExtractor") _import_structure["models.layoutlmv2"].append("LayoutLMv2FeatureExtractor")
_import_structure["models.layoutlmv2"].append("LayoutLMv2Processor") _import_structure["models.layoutlmv2"].append("LayoutLMv2Processor")
_import_structure["models.layoutxlm"].append("LayoutXLMProcessor") _import_structure["models.layoutxlm"].append("LayoutXLMProcessor")
_import_structure["models.perceiver"].append("PerceiverFeatureExtractor")
_import_structure["models.segformer"].append("SegformerFeatureExtractor") _import_structure["models.segformer"].append("SegformerFeatureExtractor")
_import_structure["models.vit"].append("ViTFeatureExtractor") _import_structure["models.vit"].append("ViTFeatureExtractor")
else: else:
...@@ -1144,6 +1146,21 @@ if is_torch_available(): ...@@ -1144,6 +1146,21 @@ if is_torch_available():
_import_structure["models.pegasus"].extend( _import_structure["models.pegasus"].extend(
["PegasusForCausalLM", "PegasusForConditionalGeneration", "PegasusModel", "PegasusPreTrainedModel"] ["PegasusForCausalLM", "PegasusForConditionalGeneration", "PegasusModel", "PegasusPreTrainedModel"]
) )
_import_structure["models.perceiver"].extend(
[
"PERCEIVER_PRETRAINED_MODEL_ARCHIVE_LIST",
"PerceiverForImageClassificationConvProcessing",
"PerceiverForImageClassificationFourier",
"PerceiverForImageClassificationLearned",
"PerceiverForMaskedLM",
"PerceiverForMultimodalAutoencoding",
"PerceiverForOpticalFlow",
"PerceiverForSequenceClassification",
"PerceiverLayer",
"PerceiverModel",
"PerceiverPreTrainedModel",
]
)
_import_structure["models.prophetnet"].extend( _import_structure["models.prophetnet"].extend(
[ [
"PROPHETNET_PRETRAINED_MODEL_ARCHIVE_LIST", "PROPHETNET_PRETRAINED_MODEL_ARCHIVE_LIST",
...@@ -2263,6 +2280,7 @@ if TYPE_CHECKING: ...@@ -2263,6 +2280,7 @@ if TYPE_CHECKING:
from .models.mt5 import MT5Config from .models.mt5 import MT5Config
from .models.openai import OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP, OpenAIGPTConfig, OpenAIGPTTokenizer from .models.openai import OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP, OpenAIGPTConfig, OpenAIGPTTokenizer
from .models.pegasus import PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP, PegasusConfig, PegasusTokenizer from .models.pegasus import PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP, PegasusConfig, PegasusTokenizer
from .models.perceiver import PERCEIVER_PRETRAINED_CONFIG_ARCHIVE_MAP, PerceiverConfig, PerceiverTokenizer
from .models.phobert import PhobertTokenizer from .models.phobert import PhobertTokenizer
from .models.prophetnet import PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP, ProphetNetConfig, ProphetNetTokenizer from .models.prophetnet import PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP, ProphetNetConfig, ProphetNetTokenizer
from .models.qdqbert import QDQBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, QDQBertConfig from .models.qdqbert import QDQBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, QDQBertConfig
...@@ -2470,6 +2488,7 @@ if TYPE_CHECKING: ...@@ -2470,6 +2488,7 @@ if TYPE_CHECKING:
from .models.imagegpt import ImageGPTFeatureExtractor from .models.imagegpt import ImageGPTFeatureExtractor
from .models.layoutlmv2 import LayoutLMv2FeatureExtractor, LayoutLMv2Processor from .models.layoutlmv2 import LayoutLMv2FeatureExtractor, LayoutLMv2Processor
from .models.layoutxlm import LayoutXLMProcessor from .models.layoutxlm import LayoutXLMProcessor
from .models.perceiver import PerceiverFeatureExtractor
from .models.segformer import SegformerFeatureExtractor from .models.segformer import SegformerFeatureExtractor
from .models.vit import ViTFeatureExtractor from .models.vit import ViTFeatureExtractor
else: else:
...@@ -3006,6 +3025,19 @@ if TYPE_CHECKING: ...@@ -3006,6 +3025,19 @@ if TYPE_CHECKING:
PegasusModel, PegasusModel,
PegasusPreTrainedModel, PegasusPreTrainedModel,
) )
from .models.perceiver import (
PERCEIVER_PRETRAINED_MODEL_ARCHIVE_LIST,
PerceiverForImageClassificationConvProcessing,
PerceiverForImageClassificationFourier,
PerceiverForImageClassificationLearned,
PerceiverForMaskedLM,
PerceiverForMultimodalAutoencoding,
PerceiverForOpticalFlow,
PerceiverForSequenceClassification,
PerceiverLayer,
PerceiverModel,
PerceiverPreTrainedModel,
)
from .models.prophetnet import ( from .models.prophetnet import (
PROPHETNET_PRETRAINED_MODEL_ARCHIVE_LIST, PROPHETNET_PRETRAINED_MODEL_ARCHIVE_LIST,
ProphetNetDecoder, ProphetNetDecoder,
......
...@@ -78,6 +78,7 @@ from . import ( ...@@ -78,6 +78,7 @@ from . import (
mt5, mt5,
openai, openai,
pegasus, pegasus,
perceiver,
phobert, phobert,
prophetnet, prophetnet,
qdqbert, qdqbert,
......
...@@ -37,6 +37,7 @@ CONFIG_MAPPING_NAMES = OrderedDict( ...@@ -37,6 +37,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
("fnet", "FNetConfig"), ("fnet", "FNetConfig"),
("segformer", "SegformerConfig"), ("segformer", "SegformerConfig"),
("vision-text-dual-encoder", "VisionTextDualEncoderConfig"), ("vision-text-dual-encoder", "VisionTextDualEncoderConfig"),
("perceiver", "PerceiverConfig"),
("gptj", "GPTJConfig"), ("gptj", "GPTJConfig"),
("layoutlmv2", "LayoutLMv2Config"), ("layoutlmv2", "LayoutLMv2Config"),
("beit", "BeitConfig"), ("beit", "BeitConfig"),
...@@ -119,6 +120,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict( ...@@ -119,6 +120,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
("fnet", "FNET_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("fnet", "FNET_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("pegasus", "PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("pegasus", "PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("segformer", "SEGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("segformer", "SEGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("perceiver", "PERCEIVER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("gptj", "GPTJ_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("gptj", "GPTJ_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("layoutlmv2", "LAYOUTLMV2_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("layoutlmv2", "LAYOUTLMV2_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("beit", "BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("beit", "BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
...@@ -194,6 +196,7 @@ MODEL_NAMES_MAPPING = OrderedDict( ...@@ -194,6 +196,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
("fnet", "FNet"), ("fnet", "FNet"),
("segformer", "SegFormer"), ("segformer", "SegFormer"),
("vision-text-dual-encoder", "VisionTextDualEncoder"), ("vision-text-dual-encoder", "VisionTextDualEncoder"),
("perceiver", "Perceiver"),
("gptj", "GPT-J"), ("gptj", "GPT-J"),
("beit", "BEiT"), ("beit", "BEiT"),
("rembert", "RemBERT"), ("rembert", "RemBERT"),
......
...@@ -33,6 +33,7 @@ MODEL_MAPPING_NAMES = OrderedDict( ...@@ -33,6 +33,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
("fnet", "FNetModel"), ("fnet", "FNetModel"),
("segformer", "SegformerModel"), ("segformer", "SegformerModel"),
("vision-text-dual-encoder", "VisionTextDualEncoderModel"), ("vision-text-dual-encoder", "VisionTextDualEncoderModel"),
("perceiver", "PerceiverModel"),
("gptj", "GPTJModel"), ("gptj", "GPTJModel"),
("layoutlmv2", "LayoutLMv2Model"), ("layoutlmv2", "LayoutLMv2Model"),
("beit", "BeitModel"), ("beit", "BeitModel"),
...@@ -247,6 +248,14 @@ MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING_NAMES = OrderedDict( ...@@ -247,6 +248,14 @@ MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING_NAMES = OrderedDict(
("beit", "BeitForImageClassification"), ("beit", "BeitForImageClassification"),
("segformer", "SegformerForImageClassification"), ("segformer", "SegformerForImageClassification"),
("imagegpt", "ImageGPTForImageClassification"), ("imagegpt", "ImageGPTForImageClassification"),
(
"perceiver",
(
"PerceiverForImageClassificationLearned",
"PerceiverForImageClassificationFourier",
"PerceiverForImageClassificationConvProcessing",
),
),
] ]
) )
...@@ -266,6 +275,7 @@ MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES = OrderedDict( ...@@ -266,6 +275,7 @@ MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES = OrderedDict(
MODEL_FOR_MASKED_LM_MAPPING_NAMES = OrderedDict( MODEL_FOR_MASKED_LM_MAPPING_NAMES = OrderedDict(
[ [
# Model for Masked LM mapping # Model for Masked LM mapping
("perceiver", "PerceiverForMaskedLM"),
("qdqbert", "QDQBertForMaskedLM"), ("qdqbert", "QDQBertForMaskedLM"),
("fnet", "FNetForMaskedLM"), ("fnet", "FNetForMaskedLM"),
("rembert", "RemBertForMaskedLM"), ("rembert", "RemBertForMaskedLM"),
...@@ -337,6 +347,7 @@ MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES = OrderedDict( ...@@ -337,6 +347,7 @@ MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES = OrderedDict(
MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES = OrderedDict( MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES = OrderedDict(
[ [
# Model for Sequence Classification mapping # Model for Sequence Classification mapping
("perceiver", "PerceiverForSequenceClassification"),
("qdqbert", "QDQBertForSequenceClassification"), ("qdqbert", "QDQBertForSequenceClassification"),
("fnet", "FNetForSequenceClassification"), ("fnet", "FNetForSequenceClassification"),
("gptj", "GPTJForSequenceClassification"), ("gptj", "GPTJForSequenceClassification"),
......
# flake8: noqa
# There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all.
# Copyright 2021 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...file_utils import _LazyModule, is_tokenizers_available, is_torch_available, is_vision_available
_import_structure = {
"configuration_perceiver": ["PERCEIVER_PRETRAINED_CONFIG_ARCHIVE_MAP", "PerceiverConfig"],
"tokenization_perceiver": ["PerceiverTokenizer"],
}
if is_vision_available():
_import_structure["feature_extraction_perceiver"] = ["PerceiverFeatureExtractor"]
if is_torch_available():
_import_structure["modeling_perceiver"] = [
"PERCEIVER_PRETRAINED_MODEL_ARCHIVE_LIST",
"PerceiverForImageClassificationConvProcessing",
"PerceiverForImageClassificationFourier",
"PerceiverForImageClassificationLearned",
"PerceiverForMaskedLM",
"PerceiverForMultimodalAutoencoding",
"PerceiverForOpticalFlow",
"PerceiverForSequenceClassification",
"PerceiverLayer",
"PerceiverModel",
"PerceiverPreTrainedModel",
]
if TYPE_CHECKING:
from .configuration_perceiver import PERCEIVER_PRETRAINED_CONFIG_ARCHIVE_MAP, PerceiverConfig
from .tokenization_perceiver import PerceiverTokenizer
if is_vision_available():
from .feature_extraction_perceiver import PerceiverFeatureExtractor
if is_torch_available():
from .modeling_perceiver import (
PERCEIVER_PRETRAINED_MODEL_ARCHIVE_LIST,
PerceiverForImageClassificationConvProcessing,
PerceiverForImageClassificationFourier,
PerceiverForImageClassificationLearned,
PerceiverForMaskedLM,
PerceiverForMultimodalAutoencoding,
PerceiverForOpticalFlow,
PerceiverForSequenceClassification,
PerceiverLayer,
PerceiverModel,
PerceiverPreTrainedModel,
)
else:
import sys
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure)
# coding=utf-8
# Copyright Deepmind and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Perceiver model configuration """
from ...configuration_utils import PretrainedConfig
from ...utils import logging
logger = logging.get_logger(__name__)
PERCEIVER_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"deepmind/language-perceiver": "https://huggingface.co/deepmind/language-perceiver/resolve/main/config.json",
# See all Perceiver models at https://huggingface.co/models?filter=perceiver
}
class PerceiverConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a :class:`~transformers.PerceiverModel`. It is used
to instantiate an Perceiver model according to the specified arguments, defining the model architecture.
Instantiating a configuration with the defaults will yield a similar configuration to that of the Perceiver
`deepmind/language-perceiver <https://huggingface.co/deepmind/language-perceiver>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
Args:
num_latents (:obj:`int`, `optional`, defaults to 256):
The number of latents.
d_latents (:obj:`int`, `optional`, defaults to 1280):
Dimension of the latent embeddings.
d_model (:obj:`int`, `optional`, defaults to 768):
Dimension of the inputs.
num_blocks (:obj:`int`, `optional`, defaults to 1):
Number of blocks in the Transformer encoder.
num_self_attends_per_block (:obj:`int`, `optional`, defaults to 26):
The number of self-attention layers per block.
num_self_attention_heads (:obj:`int`, `optional`, defaults to 8):
Number of attention heads for each self-attention layer in the Transformer encoder.
num_cross_attention_heads (:obj:`int`, `optional`, defaults to 8):
Number of attention heads for each cross-attention layer in the Transformer encoder.
qk_channels (:obj:`int`, `optional`):
Dimension to project the queries + keys before applying attention in the cross-attention and self-attention
layers of the encoder. Will default to preserving the dimension of the queries if not specified.
v_channels (:obj:`int`, `optional`):
Dimension to project the values before applying attention in the cross-attention and self-attention layers
of the encoder. Will default to preserving the dimension of the queries if not specified.
cross_attention_shape_for_attention (:obj:`str`, `optional`, defaults to :obj:`'kv'`):
Dimension to use when downsampling the queries and keys in the cross-attention layer of the encoder.
self_attention_widening_factor (:obj:`int`, `optional`, defaults to 1):
Dimension of the feed-forward layer in the cross-attention layer of the Transformer encoder.
cross_attention_widening_factor (:obj:`int`, `optional`, defaults to 1):
Dimension of the feed-forward layer in the self-attention layers of the Transformer encoder.
hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string,
:obj:`"gelu"`, :obj:`"relu"`, :obj:`"selu"` and :obj:`"gelu_new"` are supported.
attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout ratio for the attention probabilities.
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
The epsilon used by the layer normalization layers.
use_query_residual (:obj:`float`, `optional`, defaults to :obj:`True`):
Whether to add a query residual in the cross-attention layer of the encoder.
vocab_size (:obj:`int`, `optional`, defaults to 262):
Vocabulary size for the masked language modeling model.
max_position_embeddings (:obj:`int`, `optional`, defaults to 2048):
The maximum sequence length that the masked language modeling model might ever be used with. Typically set
this to something large just in case (e.g., 512 or 1024 or 2048).
image_size (:obj:`int`, `optional`, defaults to 56):
Size of the images after preprocessing, for :class:`~transformers.PerceiverForImageClassificationLearned`.
train_size (:obj:`List[int]`, `optional`, defaults to [368, 496]):
Training size of the images for the optical flow model.
num_frames (:obj:`int`, `optional`, defaults to 16):
Number of video frames used for the multimodal autoencoding model.
audio_samples_per_frame (:obj:`int`, `optional`, defaults to 1920):
Number of audio samples per frame for the multimodal autoencoding model.
samples_per_patch (:obj:`int`, `optional`, defaults to 16):
Number of audio samples per patch when preprocessing the audio for the multimodal autoencoding model.
output_shape (:obj:`List[int]`, `optional`, defaults to :obj:`[1, 16, 224, 224]`):
Shape of the output for the multimodal autoencoding model.
Example::
>>> from transformers import PerceiverModel, PerceiverConfig
>>> # Initializing a Perceiver deepmind/language-perceiver style configuration
>>> configuration = PerceiverConfig()
>>> # Initializing a model from the deepmind/language-perceiver style configuration
>>> model = PerceiverModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
"""
model_type = "perceiver"
def __init__(
self,
num_latents=256,
d_latents=1280,
d_model=768,
num_blocks=1,
num_self_attends_per_block=26,
num_self_attention_heads=8,
num_cross_attention_heads=8,
qk_channels=None,
v_channels=None,
cross_attention_shape_for_attention="kv",
self_attention_widening_factor=1,
cross_attention_widening_factor=1,
hidden_act="gelu",
attention_probs_dropout_prob=0.1,
position_embedding_init_scale=0.02,
initializer_range=0.02,
layer_norm_eps=1e-12,
is_encoder_decoder=False,
use_query_residual=True,
vocab_size=262,
max_position_embeddings=2048,
image_size=56,
train_size=[368, 496],
num_frames=16,
audio_samples_per_frame=1920,
samples_per_patch=16,
output_shape=[1, 16, 224, 224],
**kwargs
):
super().__init__(**kwargs)
self.num_latents = num_latents
self.d_latents = d_latents
self.d_model = d_model
self.num_blocks = num_blocks
self.num_self_attends_per_block = num_self_attends_per_block
self.num_self_attention_heads = num_self_attention_heads
self.num_cross_attention_heads = num_cross_attention_heads
self.qk_channels = qk_channels
self.v_channels = v_channels
self.cross_attention_shape_for_attention = cross_attention_shape_for_attention
self.self_attention_widening_factor = self_attention_widening_factor
self.cross_attention_widening_factor = cross_attention_widening_factor
self.hidden_act = hidden_act
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.initializer_range = initializer_range
self.layer_norm_eps = layer_norm_eps
self.use_query_residual = use_query_residual
# masked language modeling attributes
self.vocab_size = vocab_size
self.max_position_embeddings = max_position_embeddings
# image classification attributes
self.image_size = image_size
# flow attributes
self.train_size = train_size
# multimodal autoencoding attributes
self.num_frames = num_frames
self.audio_samples_per_frame = audio_samples_per_frame
self.samples_per_patch = samples_per_patch
self.output_shape = output_shape
# coding=utf-8
# Copyright 2021 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Convert Perceiver checkpoints originally implemented in Haiku."""
import argparse
import json
import pickle
from pathlib import Path
import numpy as np
import torch
from PIL import Image
import haiku as hk
import requests
from huggingface_hub import cached_download, hf_hub_url
from transformers import (
PerceiverConfig,
PerceiverFeatureExtractor,
PerceiverForImageClassificationConvProcessing,
PerceiverForImageClassificationFourier,
PerceiverForImageClassificationLearned,
PerceiverForMaskedLM,
PerceiverForMultimodalAutoencoding,
PerceiverForOpticalFlow,
PerceiverTokenizer,
)
from transformers.utils import logging
logging.set_verbosity_info()
logger = logging.get_logger(__name__)
def prepare_img():
# We will verify our results on an image of a dog
url = "https://storage.googleapis.com/perceiver_io/dalmation.jpg"
im = Image.open(requests.get(url, stream=True).raw)
return im
def rename_keys(state_dict, architecture):
for name in list(state_dict):
param = state_dict.pop(name)
# PREPROCESSORS
# rename text preprocessor embeddings (for MLM model)
name = name.replace("embed/embeddings", "input_preprocessor.embeddings.weight")
if name.startswith("trainable_position_encoding/pos_embs"):
name = name.replace(
"trainable_position_encoding/pos_embs", "input_preprocessor.position_embeddings.weight"
)
# rename image preprocessor embeddings (for image classification model with learned position embeddings)
name = name.replace("image_preprocessor/~/conv2_d/w", "input_preprocessor.convnet_1x1.weight")
name = name.replace("image_preprocessor/~/conv2_d/b", "input_preprocessor.convnet_1x1.bias")
name = name.replace(
"image_preprocessor/~_build_network_inputs/trainable_position_encoding/pos_embs",
"input_preprocessor.position_embeddings.position_embeddings",
)
name = name.replace(
"image_preprocessor/~_build_network_inputs/position_encoding_projector/linear/w",
"input_preprocessor.positions_projection.weight",
)
name = name.replace(
"image_preprocessor/~_build_network_inputs/position_encoding_projector/linear/b",
"input_preprocessor.positions_projection.bias",
)
# rename image preprocessor embeddings (for image classification model with conv processing)
if "counter" in name or "hidden" in name:
continue
name = name.replace(
"image_preprocessor/~/conv2_d_downsample/~/conv/w", "input_preprocessor.convnet.conv.weight"
)
name = name.replace(
"image_preprocessor/~/conv2_d_downsample/~/batchnorm/offset", "input_preprocessor.convnet.batchnorm.bias"
)
name = name.replace(
"image_preprocessor/~/conv2_d_downsample/~/batchnorm/scale", "input_preprocessor.convnet.batchnorm.weight"
)
name = name.replace(
"image_preprocessor/~/conv2_d_downsample/~/batchnorm/~/mean_ema/average",
"input_preprocessor.convnet.batchnorm.running_mean",
)
name = name.replace(
"image_preprocessor/~/conv2_d_downsample/~/batchnorm/~/var_ema/average",
"input_preprocessor.convnet.batchnorm.running_var",
)
# rename image preprocessor embeddings (for optical flow model)
name = name.replace("image_preprocessor/patches_linear/b", "input_preprocessor.conv_after_patches.bias")
name = name.replace("image_preprocessor/patches_linear/w", "input_preprocessor.conv_after_patches.weight")
# rename multimodal preprocessor embeddings
name = name.replace("multimodal_preprocessor/audio_mask_token/pos_embs", "input_preprocessor.mask.audio")
name = name.replace("multimodal_preprocessor/audio_padding/pos_embs", "input_preprocessor.padding.audio")
name = name.replace("multimodal_preprocessor/image_mask_token/pos_embs", "input_preprocessor.mask.image")
name = name.replace("multimodal_preprocessor/image_padding/pos_embs", "input_preprocessor.padding.image")
name = name.replace("multimodal_preprocessor/label_mask_token/pos_embs", "input_preprocessor.mask.label")
name = name.replace("multimodal_preprocessor/label_padding/pos_embs", "input_preprocessor.padding.label")
# DECODERS
# rename prefix of decoders
# multimodal autoencoding model
name = name.replace(
"multimodal_decoder/~/basic_decoder/cross_attention/", "decoder.decoder.decoding_cross_attention."
)
name = name.replace("multimodal_decoder/~decoder_query/audio_padding/pos_embs", "decoder.padding.audio")
name = name.replace("multimodal_decoder/~decoder_query/image_padding/pos_embs", "decoder.padding.image")
name = name.replace("multimodal_decoder/~decoder_query/label_padding/pos_embs", "decoder.padding.label")
name = name.replace("multimodal_decoder/~/basic_decoder/output/b", "decoder.decoder.final_layer.bias")
name = name.replace("multimodal_decoder/~/basic_decoder/output/w", "decoder.decoder.final_layer.weight")
if architecture == "multimodal_autoencoding":
name = name.replace(
"classification_decoder/~/basic_decoder/~/trainable_position_encoding/pos_embs",
"decoder.modalities.label.decoder.output_position_encodings.position_embeddings",
)
# flow model
name = name.replace(
"flow_decoder/~/basic_decoder/cross_attention/", "decoder.decoder.decoding_cross_attention."
)
name = name.replace("flow_decoder/~/basic_decoder/output/w", "decoder.decoder.final_layer.weight")
name = name.replace("flow_decoder/~/basic_decoder/output/b", "decoder.decoder.final_layer.bias")
# image models
name = name.replace(
"classification_decoder/~/basic_decoder/~/trainable_position_encoding/pos_embs",
"decoder.decoder.output_position_encodings.position_embeddings",
)
name = name.replace(
"basic_decoder/~/trainable_position_encoding/pos_embs",
"decoder.output_position_encodings.position_embeddings",
)
name = name.replace(
"classification_decoder/~/basic_decoder/cross_attention/", "decoder.decoder.decoding_cross_attention."
)
name = name.replace("classification_decoder/~/basic_decoder/output/b", "decoder.decoder.final_layer.bias")
name = name.replace("classification_decoder/~/basic_decoder/output/w", "decoder.decoder.final_layer.weight")
name = name = name.replace("classification_decoder/~/basic_decoder/~/", "decoder.decoder.")
name = name.replace("basic_decoder/cross_attention/", "decoder.decoding_cross_attention.")
name = name.replace("basic_decoder/~/", "decoder.")
# POSTPROCESSORS
name = name.replace(
"projection_postprocessor/linear/b", "output_postprocessor.modalities.image.classifier.bias"
)
name = name.replace(
"projection_postprocessor/linear/w", "output_postprocessor.modalities.image.classifier.weight"
)
name = name.replace(
"classification_postprocessor/linear/b", "output_postprocessor.modalities.label.classifier.bias"
)
name = name.replace(
"classification_postprocessor/linear/w", "output_postprocessor.modalities.label.classifier.weight"
)
name = name.replace("audio_postprocessor/linear/b", "output_postprocessor.modalities.audio.classifier.bias")
name = name.replace("audio_postprocessor/linear/w", "output_postprocessor.modalities.audio.classifier.weight")
# PERCEIVER MODEL
# rename latent embeddings
name = name.replace("perceiver_encoder/~/trainable_position_encoding/pos_embs", "embeddings.latents")
# rename latent embeddings (for multimodal model)
name = name.replace("encoder/~/trainable_position_encoding/pos_embs", "embeddings.latents")
# rename prefixes
if name.startswith("perceiver_encoder/~/"):
if "self_attention" in name:
suffix = "self_attends."
else:
suffix = ""
name = name.replace("perceiver_encoder/~/", "encoder." + suffix)
if name.startswith("encoder/~/"):
if "self_attention" in name:
suffix = "self_attends."
else:
suffix = ""
name = name.replace("encoder/~/", "encoder." + suffix)
# rename layernorm parameters
if "offset" in name:
name = name.replace("offset", "bias")
if "scale" in name:
name = name.replace("scale", "weight")
# in HuggingFace, the layernorm in between attention + MLP is just called "layernorm"
# rename layernorm in between attention + MLP of cross-attention
if "cross_attention" in name and "layer_norm_2" in name:
name = name.replace("layer_norm_2", "layernorm")
# rename layernorm in between attention + MLP of self-attention
if "self_attention" in name and "layer_norm_1" in name:
name = name.replace("layer_norm_1", "layernorm")
# in HuggingFace, the layernorms for queries + keys are called "layernorm1" and "layernorm2"
if "cross_attention" in name and "layer_norm_1" in name:
name = name.replace("layer_norm_1", "attention.self.layernorm2")
if "cross_attention" in name and "layer_norm" in name:
name = name.replace("layer_norm", "attention.self.layernorm1")
if "self_attention" in name and "layer_norm" in name:
name = name.replace("layer_norm", "attention.self.layernorm1")
# rename special characters by dots
name = name.replace("-", ".")
name = name.replace("/", ".")
# rename keys, queries, values and output of attention layers
if ("cross_attention" in name or "self_attention" in name) and "mlp" not in name:
if "linear.b" in name:
name = name.replace("linear.b", "self.query.bias")
if "linear.w" in name:
name = name.replace("linear.w", "self.query.weight")
if "linear_1.b" in name:
name = name.replace("linear_1.b", "self.key.bias")
if "linear_1.w" in name:
name = name.replace("linear_1.w", "self.key.weight")
if "linear_2.b" in name:
name = name.replace("linear_2.b", "self.value.bias")
if "linear_2.w" in name:
name = name.replace("linear_2.w", "self.value.weight")
if "linear_3.b" in name:
name = name.replace("linear_3.b", "output.dense.bias")
if "linear_3.w" in name:
name = name.replace("linear_3.w", "output.dense.weight")
if "self_attention_" in name:
name = name.replace("self_attention_", "")
if "self_attention" in name:
name = name.replace("self_attention", "0")
# rename dense layers of 2-layer MLP
if "mlp" in name:
if "linear.b" in name:
name = name.replace("linear.b", "dense1.bias")
if "linear.w" in name:
name = name.replace("linear.w", "dense1.weight")
if "linear_1.b" in name:
name = name.replace("linear_1.b", "dense2.bias")
if "linear_1.w" in name:
name = name.replace("linear_1.w", "dense2.weight")
# finally, TRANSPOSE if kernel and not embedding layer, and set value
if name[-6:] == "weight" and "embeddings" not in name:
param = np.transpose(param)
# if batchnorm, we need to squeeze it
if "batchnorm" in name:
param = np.squeeze(param)
if "embedding_decoder" not in name:
state_dict["perceiver." + name] = torch.from_numpy(param)
else:
state_dict[name] = torch.from_numpy(param)
@torch.no_grad()
def convert_perceiver_checkpoint(pickle_file, pytorch_dump_folder_path, architecture="MLM"):
"""
Copy/paste/tweak model's weights to our Perceiver structure.
"""
# load parameters as FlatMapping data structure
with open(pickle_file, "rb") as f:
checkpoint = pickle.loads(f.read())
state = None
if isinstance(checkpoint, dict) and architecture in [
"image_classification",
"image_classification_fourier",
"image_classification_conv",
]:
# the image classification_conv checkpoint also has batchnorm states (running_mean and running_var)
params = checkpoint["params"]
state = checkpoint["state"]
else:
params = checkpoint
# turn into initial state dict
state_dict = dict()
for scope_name, parameters in hk.data_structures.to_mutable_dict(params).items():
for param_name, param in parameters.items():
state_dict[scope_name + "/" + param_name] = param
if state is not None:
# add state variables
for scope_name, parameters in hk.data_structures.to_mutable_dict(state).items():
for param_name, param in parameters.items():
state_dict[scope_name + "/" + param_name] = param
# rename keys
rename_keys(state_dict, architecture=architecture)
# load HuggingFace model
config = PerceiverConfig()
subsampling = None
repo_id = "datasets/huggingface/label-files"
if architecture == "MLM":
config.qk_channels = 8 * 32
config.v_channels = 1280
model = PerceiverForMaskedLM(config)
elif "image_classification" in architecture:
config.num_latents = 512
config.d_latents = 1024
config.d_model = 512
config.num_blocks = 8
config.num_self_attends_per_block = 6
config.num_cross_attention_heads = 1
config.num_self_attention_heads = 8
config.qk_channels = None
config.v_channels = None
# set labels
config.num_labels = 1000
filename = "imagenet-1k-id2label.json"
id2label = json.load(open(cached_download(hf_hub_url(repo_id, filename)), "r"))
id2label = {int(k): v for k, v in id2label.items()}
config.id2label = id2label
config.label2id = {v: k for k, v in id2label.items()}
if architecture == "image_classification":
config.image_size = 224
model = PerceiverForImageClassificationLearned(config)
elif architecture == "image_classification_fourier":
config.d_model = 261
model = PerceiverForImageClassificationFourier(config)
elif architecture == "image_classification_conv":
config.d_model = 322
model = PerceiverForImageClassificationConvProcessing(config)
else:
raise ValueError(f"Architecture {architecture} not supported")
elif architecture == "optical_flow":
config.num_latents = 2048
config.d_latents = 512
config.d_model = 322
config.num_blocks = 1
config.num_self_attends_per_block = 24
config.num_self_attention_heads = 16
config.num_cross_attention_heads = 1
model = PerceiverForOpticalFlow(config)
elif architecture == "multimodal_autoencoding":
config.num_latents = 28 * 28 * 1
config.d_latents = 512
config.d_model = 704
config.num_blocks = 1
config.num_self_attends_per_block = 8
config.num_self_attention_heads = 8
config.num_cross_attention_heads = 1
config.num_labels = 700
# define dummy inputs + subsampling (as each forward pass is only on a chunk of image + audio data)
images = torch.randn((1, 16, 3, 224, 224))
audio = torch.randn((1, 30720, 1))
nchunks = 128
image_chunk_size = np.prod((16, 224, 224)) // nchunks
audio_chunk_size = audio.shape[1] // config.samples_per_patch // nchunks
# process the first chunk
chunk_idx = 0
subsampling = {
"image": torch.arange(image_chunk_size * chunk_idx, image_chunk_size * (chunk_idx + 1)),
"audio": torch.arange(audio_chunk_size * chunk_idx, audio_chunk_size * (chunk_idx + 1)),
"label": None,
}
model = PerceiverForMultimodalAutoencoding(config)
# set labels
filename = "kinetics700-id2label.json"
id2label = json.load(open(cached_download(hf_hub_url(repo_id, filename)), "r"))
id2label = {int(k): v for k, v in id2label.items()}
config.id2label = id2label
config.label2id = {v: k for k, v in id2label.items()}
else:
raise ValueError(f"Architecture {architecture} not supported")
model.eval()
# load weights
model.load_state_dict(state_dict)
# prepare dummy input
input_mask = None
if architecture == "MLM":
tokenizer = PerceiverTokenizer.from_pretrained("/Users/NielsRogge/Documents/Perceiver/Tokenizer files")
text = "This is an incomplete sentence where some words are missing."
encoding = tokenizer(text, padding="max_length", return_tensors="pt")
# mask " missing.". Note that the model performs much better if the masked chunk starts with a space.
encoding.input_ids[0, 51:60] = tokenizer.mask_token_id
inputs = encoding.input_ids
input_mask = encoding.attention_mask
elif architecture in ["image_classification", "image_classification_fourier", "image_classification_conv"]:
feature_extractor = PerceiverFeatureExtractor()
image = prepare_img()
encoding = feature_extractor(image, return_tensors="pt")
inputs = encoding.pixel_values
elif architecture == "optical_flow":
inputs = torch.randn(1, 2, 27, 368, 496)
elif architecture == "multimodal_autoencoding":
images = torch.randn((1, 16, 3, 224, 224))
audio = torch.randn((1, 30720, 1))
inputs = dict(image=images, audio=audio, label=torch.zeros((images.shape[0], 700)))
# forward pass
if architecture == "multimodal_autoencoding":
outputs = model(inputs=inputs, attention_mask=input_mask, subsampled_output_points=subsampling)
else:
outputs = model(inputs=inputs, attention_mask=input_mask)
logits = outputs.logits
# verify logits
if not isinstance(logits, dict):
print("Shape of logits:", logits.shape)
else:
for k, v in logits.items():
print(f"Shape of logits of modality {k}", v.shape)
if architecture == "MLM":
expected_slice = torch.tensor(
[[-11.8336, -11.6850, -11.8483], [-12.8149, -12.5863, -12.7904], [-12.8440, -12.6410, -12.8646]]
)
assert torch.allclose(logits[0, :3, :3], expected_slice)
masked_tokens_predictions = logits[0, 51:60].argmax(dim=-1).tolist()
expected_list = [38, 115, 111, 121, 121, 111, 116, 109, 52]
assert masked_tokens_predictions == expected_list
print("Greedy predictions:")
print(masked_tokens_predictions)
print()
print("Predicted string:")
print(tokenizer.decode(masked_tokens_predictions))
elif architecture in ["image_classification", "image_classification_fourier", "image_classification_conv"]:
print("Predicted class:", model.config.id2label[logits.argmax(-1).item()])
# Finally, save files
Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
print(f"Saving model to {pytorch_dump_folder_path}")
model.save_pretrained(pytorch_dump_folder_path)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument(
"--pickle_file",
type=str,
default=None,
required=True,
help="Path to local pickle file of a Perceiver checkpoint you'd like to convert.",
)
parser.add_argument(
"--pytorch_dump_folder_path",
default=None,
type=str,
required=True,
help="Path to the output PyTorch model directory, provided as a string.",
)
parser.add_argument(
"--architecture",
default="MLM",
type=str,
help="""
Architecture, provided as a string. One of 'MLM', 'image_classification', image_classification_fourier',
image_classification_fourier', 'optical_flow' or 'multimodal_autoencoding'.
""",
)
args = parser.parse_args()
convert_perceiver_checkpoint(args.pickle_file, args.pytorch_dump_folder_path, args.architecture)
# coding=utf-8
# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Feature extractor class for Perceiver."""
from typing import Optional, Union
import numpy as np
from PIL import Image
from ...feature_extraction_utils import BatchFeature, FeatureExtractionMixin
from ...file_utils import TensorType
from ...image_utils import (
IMAGENET_DEFAULT_MEAN,
IMAGENET_DEFAULT_STD,
ImageFeatureExtractionMixin,
ImageInput,
is_torch_tensor,
)
from ...utils import logging
logger = logging.get_logger(__name__)
class PerceiverFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionMixin):
r"""
Constructs a Perceiver feature extractor.
This feature extractor inherits from :class:`~transformers.ImageFeatureExtractionMixin` which contains most of the
main methods. Users should refer to this superclass for more information regarding those methods.
Args:
do_center_crop (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether to crop the input at the center. If the input size is smaller than :obj:`crop_size` along any edge,
the image is padded with 0's and then center cropped.
crop_size (:obj:`int`, `optional`, defaults to 256):
Desired output size when applying center-cropping. Only has an effect if :obj:`do_center_crop` is set to
:obj:`True`.
do_resize (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether to resize the input to a certain :obj:`size`.
size (:obj:`int` or :obj:`Tuple(int)`, `optional`, defaults to 224):
Resize the input to the given size. If a tuple is provided, it should be (width, height). If only an
integer is provided, then the input will be resized to (size, size). Only has an effect if :obj:`do_resize`
is set to :obj:`True`.
resample (:obj:`int`, `optional`, defaults to :obj:`PIL.Image.BICUBIC`):
An optional resampling filter. This can be one of :obj:`PIL.Image.NEAREST`, :obj:`PIL.Image.BOX`,
:obj:`PIL.Image.BILINEAR`, :obj:`PIL.Image.HAMMING`, :obj:`PIL.Image.BICUBIC` or :obj:`PIL.Image.LANCZOS`.
Only has an effect if :obj:`do_resize` is set to :obj:`True`.
do_normalize (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to normalize the input with :obj:`image_mean` and :obj:`image_std`.
image_mean (:obj:`List[int]`, defaults to :obj:`[0.485, 0.456, 0.406]`):
The sequence of means for each channel, to be used when normalizing images.
image_std (:obj:`List[int]`, defaults to :obj:`[0.229, 0.224, 0.225]`):
The sequence of standard deviations for each channel, to be used when normalizing images.
"""
model_input_names = ["pixel_values"]
def __init__(
self,
do_center_crop=True,
crop_size=256,
do_resize=True,
size=224,
resample=Image.BICUBIC,
do_normalize=True,
image_mean=None,
image_std=None,
**kwargs
):
super().__init__(**kwargs)
self.do_center_crop = do_center_crop
self.crop_size = crop_size
self.do_resize = do_resize
self.size = size
self.resample = resample
self.do_normalize = do_normalize
self.image_mean = image_mean if image_mean is not None else IMAGENET_DEFAULT_MEAN
self.image_std = image_std if image_std is not None else IMAGENET_DEFAULT_STD
def center_crop(self, image):
"""
Crops :obj:`image` to `self.crop_size` using a center crop. Note that if the image is too small to be cropped
to the size given, it will be padded (so the returned result has the size asked).
Args:
image (:obj:`PIL.Image.Image` or :obj:`np.ndarray` or :obj:`torch.Tensor`):
The image to resize.
"""
if isinstance(image, Image.Image):
image = self.to_numpy_array(image)
image_height, image_width = image.shape[-2:]
padded_center_crop_size = (
(self.size / (self.crop_size)) * np.minimum(image_height, image_width).astype(np.float32)
).astype(np.int32)
offset_height = ((image_height - padded_center_crop_size) + 1) // 2
offset_width = ((image_width - padded_center_crop_size) + 1) // 2
crop_window = [offset_height, offset_width, padded_center_crop_size, padded_center_crop_size]
image = image[
:, crop_window[0] : crop_window[0] + crop_window[2], crop_window[1] : crop_window[1] + crop_window[3]
]
return image
def __call__(
self, images: ImageInput, return_tensors: Optional[Union[str, TensorType]] = None, **kwargs
) -> BatchFeature:
"""
Main method to prepare for the model one or several image(s).
.. warning::
NumPy arrays and PyTorch tensors are converted to PIL images when resizing, so the most efficient is to pass
PIL images.
Args:
images (:obj:`PIL.Image.Image`, :obj:`np.ndarray`, :obj:`torch.Tensor`, :obj:`List[PIL.Image.Image]`, :obj:`List[np.ndarray]`, :obj:`List[torch.Tensor]`):
The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a
number of channels, H and W are image height and width.
return_tensors (:obj:`str` or :class:`~transformers.file_utils.TensorType`, `optional`, defaults to :obj:`'np'`):
If set, will return tensors of a particular framework. Acceptable values are:
* :obj:`'tf'`: Return TensorFlow :obj:`tf.constant` objects.
* :obj:`'pt'`: Return PyTorch :obj:`torch.Tensor` objects.
* :obj:`'np'`: Return NumPy :obj:`np.ndarray` objects.
* :obj:`'jax'`: Return JAX :obj:`jnp.ndarray` objects.
Returns:
:class:`~transformers.BatchFeature`: A :class:`~transformers.BatchFeature` with the following fields:
- **pixel_values** -- Pixel values to be fed to a model, of shape (batch_size, num_channels, height,
width).
"""
# Input type checking for clearer error
valid_images = False
# Check that images has a valid type
if isinstance(images, (Image.Image, np.ndarray)) or is_torch_tensor(images):
valid_images = True
elif isinstance(images, (list, tuple)):
if len(images) == 0 or isinstance(images[0], (Image.Image, np.ndarray)) or is_torch_tensor(images[0]):
valid_images = True
if not valid_images:
raise ValueError(
"Images must of type `PIL.Image.Image`, `np.ndarray` or `torch.Tensor` (single example),"
"`List[PIL.Image.Image]`, `List[np.ndarray]` or `List[torch.Tensor]` (batch of examples)."
)
is_batched = bool(
isinstance(images, (list, tuple))
and (isinstance(images[0], (Image.Image, np.ndarray)) or is_torch_tensor(images[0]))
)
if not is_batched:
images = [images]
# transformations (center cropping + resizing + normalization)
if self.do_center_crop and self.crop_size is not None:
images = [self.center_crop(image) for image in images]
if self.do_resize and self.size is not None and self.resample is not None:
images = [self.resize(image=image, size=self.size, resample=self.resample) for image in images]
if self.do_normalize:
images = [self.normalize(image=image, mean=self.image_mean, std=self.image_std) for image in images]
# return as BatchFeature
data = {"pixel_values": images}
encoded_inputs = BatchFeature(data=data, tensor_type=return_tensors)
return encoded_inputs
# coding=utf-8
# Copyright 2021 Deepmind and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" PyTorch Perceiver model. """
import abc
import math
from dataclasses import dataclass
from functools import reduce
from operator import __add__
from typing import Any, Callable, Mapping, Optional, Tuple
import numpy as np
import torch
import torch.utils.checkpoint
from torch import nn
from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
from ...activations import ACT2FN
from ...file_utils import (
ModelOutput,
add_code_sample_docstrings,
add_start_docstrings,
add_start_docstrings_to_model_forward,
replace_return_docstrings,
)
from ...modeling_outputs import BaseModelOutputWithCrossAttentions
from ...modeling_utils import (
PreTrainedModel,
apply_chunking_to_forward,
find_pruneable_heads_and_indices,
prune_linear_layer,
)
from ...utils import logging
from .configuration_perceiver import PerceiverConfig
ModalitySizeType = Mapping[str, int]
PreprocessorOutputType = Tuple[torch.Tensor, Optional[torch.Tensor], torch.Tensor]
PreprocessorType = Callable[..., PreprocessorOutputType]
PostprocessorType = Callable[..., Any]
logger = logging.get_logger(__name__)
_CHECKPOINT_FOR_DOC = "deepmind/language-perceiver"
_CONFIG_FOR_DOC = "PerceiverConfig"
_TOKENIZER_FOR_DOC = "PerceiverTokenizer"
PERCEIVER_PRETRAINED_MODEL_ARCHIVE_LIST = [
"deepmind/language-perceiver",
# See all Perceiver models at https://huggingface.co/models?filter=perceiver
]
@dataclass
class PerceiverModelOutput(ModelOutput):
"""
Base class for Perceiver base model's outputs, with potential hidden states, attentions and cross-attentions.
Args:
logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_labels)`):
Classification (or regression if config.num_labels==1) scores (before SoftMax).
last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model.
hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of
each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the
weighted average in the self-attention heads.
cross_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`. Attentions weights of the decoder's cross-attention layer, after the
attention softmax, used to compute the weighted average in the cross-attention heads.
"""
logits: torch.FloatTensor = None
last_hidden_state: torch.FloatTensor = None
hidden_states: Optional[Tuple[torch.FloatTensor]] = None
attentions: Optional[Tuple[torch.FloatTensor]] = None
cross_attentions: Optional[Tuple[torch.FloatTensor]] = None
@dataclass
class PerceiverDecoderOutput(ModelOutput):
"""
Base class for Perceiver decoder outputs, with potential cross-attentions.
Args:
logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_labels)`):
Output of the basic decoder.
cross_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`. Attentions weights of the decoder's cross-attention layer, after the
attention softmax, used to compute the weighted average in the cross-attention heads.
"""
logits: torch.FloatTensor = None
cross_attentions: Optional[Tuple[torch.FloatTensor]] = None
@dataclass
class PerceiverMaskedLMOutput(ModelOutput):
"""
Base class for Perceiver's masked language model outputs.
Args:
loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):
Masked language modeling (MLM) loss.
logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of
each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads, num_latents,
num_latents)`. Attentions weights after the attention softmax, used to compute the weighted average in the
self-attention heads.
cross_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`. Attentions weights of the decoder's cross-attention layer, after the
attention softmax, used to compute the weighted average in the cross-attention heads.
"""
loss: Optional[torch.FloatTensor] = None
logits: torch.FloatTensor = None
hidden_states: Optional[Tuple[torch.FloatTensor]] = None
attentions: Optional[Tuple[torch.FloatTensor]] = None
cross_attentions: Optional[Tuple[torch.FloatTensor]] = None
@dataclass
class PerceiverClassifierOutput(ModelOutput):
"""
Base class for Perceiver's outputs of sequence/image classification models, optical flow and multimodal
autoencoding.
Args:
loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):
Classification (or regression if config.num_labels==1) loss.
logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):
Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of
each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the
weighted average in the self-attention heads.
cross_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`. Attentions weights of the decoder's cross-attention layer, after the
attention softmax, used to compute the weighted average in the cross-attention heads.
"""
loss: Optional[torch.FloatTensor] = None
logits: torch.FloatTensor = None
hidden_states: Optional[Tuple[torch.FloatTensor]] = None
attentions: Optional[Tuple[torch.FloatTensor]] = None
cross_attentions: Optional[Tuple[torch.FloatTensor]] = None
class PerceiverEmbeddings(nn.Module):
"""Construct the latent embeddings."""
def __init__(self, config):
super().__init__()
self.latents = nn.Parameter(torch.randn(config.num_latents, config.d_latents))
def forward(self, batch_size):
return self.latents.expand(batch_size, -1, -1) # Thanks, Phil Wang
class PerceiverSelfAttention(nn.Module):
"""Multi-headed {cross, self}-attention. Can be used both in the encoder as well as in the decoder."""
def __init__(
self,
config,
is_cross_attention=False,
qk_channels=None,
v_channels=None,
num_heads=1,
q_dim=None,
kv_dim=None,
):
super().__init__()
self.num_heads = num_heads
# Q and K must have the same number of channels.
# Default to preserving Q's input's shape.
if qk_channels is None:
qk_channels = q_dim
# V's num_channels determines the shape of the output of QKV-attention.
# Default to the same number of channels used in the key-query operation.
if v_channels is None:
v_channels = qk_channels
if qk_channels % num_heads != 0:
raise ValueError(f"qk_channels ({qk_channels}) must be divisible by num_heads ({num_heads}).")
if v_channels % num_heads != 0:
raise ValueError(f"v_channels ({v_channels}) must be divisible by num_heads ({num_heads}).")
self.qk_channels = qk_channels
self.v_channels = v_channels
self.qk_channels_per_head = self.qk_channels // num_heads
self.v_channels_per_head = self.v_channels // num_heads
# Layer normalization
self.layernorm1 = nn.LayerNorm(q_dim)
self.layernorm2 = nn.LayerNorm(kv_dim) if is_cross_attention else nn.Identity()
# Projection matrices
self.query = nn.Linear(q_dim, qk_channels)
self.key = nn.Linear(kv_dim, qk_channels)
self.value = nn.Linear(kv_dim, v_channels)
self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
def transpose_for_scores(self, x, channels_per_head):
new_x_shape = x.size()[:-1] + (self.num_heads, channels_per_head)
x = x.view(*new_x_shape)
return x.permute(0, 2, 1, 3)
def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
inputs=None,
inputs_mask=None,
output_attentions=False,
):
hidden_states = self.layernorm1(hidden_states)
inputs = self.layernorm2(inputs)
# Project queries, keys and values to a common feature dimension. If this is instantiated as a cross-attention module,
# the keys and values come from the inputs; the attention mask needs to be such that the inputs's non-relevant tokens are not attended to.
is_cross_attention = inputs is not None
queries = self.query(hidden_states)
if is_cross_attention:
keys = self.key(inputs)
values = self.value(inputs)
attention_mask = inputs_mask
else:
keys = self.key(hidden_states)
values = self.value(hidden_states)
# Reshape channels for multi-head attention.
# We reshape from (batch_size, time, channels) to (batch_size, num_heads, time, channels per head)
queries = self.transpose_for_scores(queries, self.qk_channels_per_head)
keys = self.transpose_for_scores(keys, self.qk_channels_per_head)
values = self.transpose_for_scores(values, self.v_channels_per_head)
# Take the dot product between the queries and keys to get the raw attention scores.
attention_scores = torch.matmul(queries, keys.transpose(-1, -2))
batch_size, num_heads, seq_len, q_head_dim = queries.shape
_, _, _, v_head_dim = values.shape
hiddens = self.num_heads * v_head_dim
attention_scores = attention_scores / math.sqrt(q_head_dim)
if attention_mask is not None:
# Apply the attention mask (precomputed for all layers in PerceiverModel forward() function)
attention_scores = attention_scores + attention_mask
# Normalize the attention scores to probabilities.
attention_probs = nn.Softmax(dim=-1)(attention_scores)
# This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper.
attention_probs = self.dropout(attention_probs)
# Mask heads if we want to
if head_mask is not None:
attention_probs = attention_probs * head_mask
context_layer = torch.matmul(attention_probs, values)
context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
new_context_layer_shape = context_layer.size()[:-2] + (hiddens,)
context_layer = context_layer.view(*new_context_layer_shape)
outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
return outputs
class PerceiverSelfOutput(nn.Module):
def __init__(self, config, input_channels, output_channels):
super().__init__()
self.dense = nn.Linear(input_channels, output_channels)
def forward(self, hidden_states):
hidden_states = self.dense(hidden_states)
return hidden_states
class PerceiverAttention(nn.Module):
"""Attention module, including a dense block."""
def __init__(
self,
config,
is_cross_attention=False,
qk_channels=None,
v_channels=None,
num_heads=1,
q_dim=None,
kv_dim=None,
use_query_residual=True,
):
super().__init__()
# MultiHead attention
if is_cross_attention and qk_channels is None:
if config.cross_attention_shape_for_attention == "q":
qk_channels = q_dim
elif config.cross_attention_shape_for_attention == "kv":
qk_channels = kv_dim
else:
raise ValueError(
f"Unknown value {config.cross_attention_shape_for_attention} for "
"cross_attention_shape_for_attention."
)
else:
if qk_channels is None:
qk_channels = q_dim
if v_channels is None:
v_channels = qk_channels
self.self = PerceiverSelfAttention(
config,
is_cross_attention=is_cross_attention,
qk_channels=qk_channels,
v_channels=v_channels,
num_heads=num_heads,
q_dim=q_dim,
kv_dim=kv_dim,
)
# dense block
output_channels = None
if is_cross_attention:
output_channels = q_dim
else:
if output_channels is None:
output_channels = v_channels
self.output = PerceiverSelfOutput(config, input_channels=self.self.v_channels, output_channels=output_channels)
self.use_query_residual = use_query_residual
self.pruned_heads = set()
def prune_heads(self, heads):
if len(heads) == 0:
return
heads, index = find_pruneable_heads_and_indices(
heads, self.self.num_attention_heads, self.self.attention_head_size, self.pruned_heads
)
# Prune linear layers
self.self.query = prune_linear_layer(self.self.query, index)
self.self.key = prune_linear_layer(self.self.key, index)
self.self.value = prune_linear_layer(self.self.value, index)
self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
# Update hyper params and store pruned heads
self.self.num_attention_heads = self.self.num_attention_heads - len(heads)
self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads
self.pruned_heads = self.pruned_heads.union(heads)
def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
inputs=None,
inputs_mask=None,
output_attentions=False,
):
self_outputs = self.self(
hidden_states,
attention_mask,
head_mask,
inputs,
inputs_mask,
output_attentions,
)
# Output projection
attention_output = self.output(self_outputs[0])
# Optionally include a residual to the original queries.
# Consider omitting the residual if the semantics of query and output
# are different, e.g. if queries are positions and outputs are pixels.
if self.use_query_residual:
attention_output = attention_output + hidden_states
outputs = (attention_output,) + self_outputs[1:] # add attentions if we output them
return outputs
class PerceiverMLP(nn.Module):
"""A Transformer-style dense module to follow attention."""
def __init__(self, config, input_size, widening_factor):
super().__init__()
self.dense1 = nn.Linear(input_size, widening_factor * input_size)
if isinstance(config.hidden_act, str):
self.intermediate_act_fn = ACT2FN[config.hidden_act]
else:
self.intermediate_act_fn = config.hidden_act
self.dense2 = nn.Linear(input_size, input_size)
def forward(self, hidden_states):
hidden_states = self.dense1(hidden_states)
hidden_states = self.intermediate_act_fn(hidden_states)
hidden_states = self.dense2(hidden_states)
return hidden_states
class PerceiverLayer(nn.Module):
def __init__(
self,
config,
is_cross_attention=False,
qk_channels=None,
v_channels=None,
num_heads=1,
q_dim=None,
kv_dim=None,
widening_factor=4,
use_query_residual=True,
):
super().__init__()
self.chunk_size_feed_forward = config.chunk_size_feed_forward
self.seq_len_dim = 1
self.attention = PerceiverAttention(
config,
is_cross_attention=is_cross_attention,
qk_channels=qk_channels,
v_channels=v_channels,
num_heads=num_heads,
q_dim=q_dim,
kv_dim=kv_dim,
use_query_residual=use_query_residual,
)
self.layernorm = nn.LayerNorm(q_dim)
self.mlp = PerceiverMLP(config, input_size=q_dim, widening_factor=widening_factor)
def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
inputs=None,
inputs_mask=None,
output_attentions=False,
):
attention_outputs = self.attention(
hidden_states,
attention_mask,
head_mask,
inputs,
inputs_mask,
output_attentions,
)
attention_output = attention_outputs[0]
outputs = attention_outputs[1:] # add attentions if we output attention weights
layer_output = apply_chunking_to_forward(
self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
)
layer_output = layer_output + attention_output # residual connection
outputs = (layer_output,) + outputs
return outputs
def feed_forward_chunk(self, attention_output):
layer_output = self.layernorm(attention_output)
layer_output = self.mlp(layer_output)
return layer_output
class PerceiverEncoder(nn.Module):
"""The Perceiver Encoder: a scalable, fully attentional encoder."""
def __init__(self, config):
super().__init__()
self.config = config
# Check that we can use multihead-attention with these shapes.
if config.d_latents % config.num_self_attention_heads != 0:
raise ValueError(
f"num_z_channels ({config.d_latents}) must be divisible by"
f" num_self_attend_heads ({config.num_self_attention_heads})."
)
if config.d_latents % config.num_cross_attention_heads != 0:
raise ValueError(
f"num_z_channels ({config.d_latents}) must be divisible by"
f" num_cross_attend_heads ({config.num_cross_attention_heads})."
)
# Construct the cross attention layer.
self.cross_attention = PerceiverLayer(
config,
is_cross_attention=True,
qk_channels=config.qk_channels,
v_channels=config.v_channels,
num_heads=config.num_cross_attention_heads,
q_dim=config.d_latents,
kv_dim=config.d_model,
widening_factor=config.cross_attention_widening_factor,
use_query_residual=config.use_query_residual,
)
# Construct a single block of self-attention layers.
# We get deeper architectures by applying this block more than once.
self_attention_layers = []
for _ in range(config.num_self_attends_per_block):
layer = PerceiverLayer(
config,
is_cross_attention=False,
qk_channels=config.qk_channels,
v_channels=config.v_channels,
num_heads=config.num_self_attention_heads,
q_dim=config.d_latents,
kv_dim=config.d_latents,
widening_factor=config.self_attention_widening_factor,
)
self_attention_layers.append(layer)
self.self_attends = nn.ModuleList(self_attention_layers)
def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
inputs=None,
inputs_mask=None,
output_attentions=False,
output_hidden_states=False,
return_dict=True,
):
all_hidden_states = () if output_hidden_states else None
all_self_attentions = () if output_attentions else None
all_cross_attentions = () if output_attentions else None
# Apply the cross-attention between the latents (hidden_states) and inputs:
layer_outputs = self.cross_attention(
hidden_states,
attention_mask=attention_mask,
head_mask=None,
inputs=inputs,
inputs_mask=inputs_mask,
output_attentions=output_attentions,
)
hidden_states = layer_outputs[0]
if output_attentions:
all_cross_attentions = all_cross_attentions + (layer_outputs[1],)
# Apply the block of self-attention layers more than once:
for _ in range(self.config.num_blocks):
for i, layer_module in enumerate(self.self_attends):
if output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states,)
layer_head_mask = head_mask[i] if head_mask is not None else None
layer_outputs = layer_module(
hidden_states,
attention_mask=attention_mask,
head_mask=layer_head_mask,
output_attentions=output_attentions,
)
hidden_states = layer_outputs[0]
if output_attentions:
all_self_attentions = all_self_attentions + (layer_outputs[1],)
if output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states,)
if not return_dict:
return tuple(
v
for v in [hidden_states, all_hidden_states, all_self_attentions, all_cross_attentions]
if v is not None
)
return BaseModelOutputWithCrossAttentions(
last_hidden_state=hidden_states,
hidden_states=all_hidden_states,
attentions=all_self_attentions,
cross_attentions=all_cross_attentions,
)
class PerceiverPreTrainedModel(PreTrainedModel):
"""
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
models.
"""
config_class = PerceiverConfig
base_model_prefix = "perceiver"
def _init_weights(self, module):
"""Initialize the weights"""
if isinstance(module, (nn.Linear, nn.Conv2d)):
# Slightly different from the TF version which uses truncated_normal for initialization
# cf https://github.com/pytorch/pytorch/pull/5617
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
if module.bias is not None:
module.bias.data.zero_()
elif hasattr(module, "latents"):
module.latents.data.normal_(mean=0.0, std=self.config.initializer_range)
elif hasattr(module, "position_embeddings") and isinstance(module, PerceiverTrainablePositionEncoding):
module.position_embeddings.data.normal_(mean=0.0, std=self.config.initializer_range)
elif isinstance(module, nn.ParameterDict):
for modality in module.keys():
module[modality].data.normal_(mean=0.0, std=self.config.initializer_range)
elif isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
elif isinstance(module, nn.LayerNorm):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
PERCEIVER_START_DOCSTRING = r"""
This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class. Use
it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and
behavior.
Parameters:
config (:class:`~transformers.PerceiverConfig`): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model
weights.
"""
PERCEIVER_MODEL_START_DOCSTRING = r"""
This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class. Use
it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and
behavior.
Parameters:
config (:class:`~transformers.PerceiverConfig`): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model
weights.
decoder (`DecoderType`, `optional`):
Optional decoder to use to decode the latent representation of the encoder. Examples include
`transformers.models.perceiver.modeling_perceiver.PerceiverBasicDecoder`,
`transformers.models.perceiver.modeling_perceiver.PerceiverClassificationDecoder`,
`transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalDecoder`.
input_preprocessor (`PreprocessorType`, `optional`):
Optional input preprocessor to use. Examples include
`transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor`,
`transformers.models.perceiver.modeling_perceiver.PerceiverAudioPreprocessor`,
`transformers.models.perceiver.modeling_perceiver.PerceiverTextPreprocessor`,
`transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPreprocessor`.
output_postprocessor (`PostprocessorType`, `optional`):
Optional output postprocessor to use. Examples include
`transformers.models.perceiver.modeling_perceiver.PerceiverImagePostprocessor`,
`transformers.models.perceiver.modeling_perceiver.PerceiverAudioPostprocessor`,
`transformers.models.perceiver.modeling_perceiver.PerceiverClassificationPostprocessor`,
`transformers.models.perceiver.modeling_perceiver.PerceiverProjectionPostprocessor`,
`transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPostprocessor`.
Note that you can define your own decoders, preprocessors and/or postprocessors to fit your use-case.
"""
PERCEIVER_INPUTS_DOCSTRING = r"""
Args:
inputs (:obj:`torch.FloatTensor`):
Inputs to the perceiver. Can be anything: images, text, audio, video, etc.
attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`):
Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
Mask to nullify selected heads of the self-attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
output_attentions (:obj:`bool`, `optional`):
Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
tensors for more detail.
output_hidden_states (:obj:`bool`, `optional`):
Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
more detail.
return_dict (:obj:`bool`, `optional`):
Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
"""
@add_start_docstrings(
"""The Perceiver: a scalable, fully attentional architecture.""",
PERCEIVER_MODEL_START_DOCSTRING,
)
class PerceiverModel(PerceiverPreTrainedModel):
def __init__(
self,
config,
decoder=None,
input_preprocessor: PreprocessorType = None,
output_postprocessor: PostprocessorType = None,
):
super().__init__(config)
self.config = config
self.input_preprocessor = input_preprocessor
self.output_postprocessor = output_postprocessor
self.embeddings = PerceiverEmbeddings(config)
self.encoder = PerceiverEncoder(config)
self.decoder = decoder
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.embeddings.latents
def set_input_embeddings(self, value):
self.embeddings.latents = value
def _prune_heads(self, heads_to_prune):
"""
Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
class PreTrainedModel
"""
for layer, heads in heads_to_prune.items():
self.encoder.layer[layer].attention.prune_heads(heads)
@add_start_docstrings_to_model_forward(PERCEIVER_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
@add_code_sample_docstrings(
processor_class=_TOKENIZER_FOR_DOC,
checkpoint=_CHECKPOINT_FOR_DOC,
output_type=PerceiverModelOutput,
config_class=_CONFIG_FOR_DOC,
)
def forward(
self,
inputs,
attention_mask=None,
subsampled_output_points=None,
head_mask=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
):
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
if self.input_preprocessor is not None:
inputs, modality_sizes, inputs_without_pos = self.input_preprocessor(inputs)
else:
modality_sizes = None
inputs_without_pos = None
if inputs.size()[-1] != self.config.d_model:
raise ValueError(
f"Last dimension of the inputs: {inputs.size()[-1]} doesn't correspond to config.d_model: {self.config.d_model}. "
"Please update config.d_model appropriately."
)
else:
input_shape = inputs.size()
batch_size, seq_length, _ = input_shape
device = inputs.device
# If no attention mask is provided, make them all ones
if attention_mask is None:
attention_mask = torch.ones(((batch_size, seq_length)), device=device)
# Make the attention mask broadcastable to [batch_size, num_heads, seq_length, seq_length]
extended_attention_mask = self.invert_attention_mask(attention_mask)
# Prepare head mask if needed
# 1.0 in head_mask indicate we keep the head
# attention_probs has shape bsz x n_heads x N x N
# input head_mask has shape [num_heads] or [num_blocks x num_heads]
# and head_mask is converted to shape [num_blocks x batch x num_heads x N x N]
head_mask = self.get_head_mask(head_mask, self.config.num_blocks * self.config.num_self_attends_per_block)
embedding_output = self.embeddings(batch_size=batch_size)
encoder_outputs = self.encoder(
embedding_output,
attention_mask=None,
head_mask=head_mask,
inputs=inputs,
inputs_mask=extended_attention_mask,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
sequence_output = encoder_outputs[0]
logits = None
if self.decoder:
if subsampled_output_points is not None:
output_modality_sizes = {
"audio": subsampled_output_points["audio"].shape[0],
"image": subsampled_output_points["image"].shape[0],
"label": 1,
}
else:
output_modality_sizes = None
decoder_query = self.decoder.decoder_query(
inputs, modality_sizes, inputs_without_pos, subsampled_points=subsampled_output_points
)
decoder_outputs = self.decoder(
decoder_query,
z=sequence_output,
query_mask=extended_attention_mask,
output_attentions=output_attentions,
)
logits = decoder_outputs.logits
# add cross-attentions of decoder
if output_attentions and decoder_outputs.cross_attentions is not None:
if return_dict:
encoder_outputs.cross_attentions = (
encoder_outputs.cross_attentions + decoder_outputs.cross_attentions
)
else:
encoder_outputs = encoder_outputs + decoder_outputs.cross_attentions
if self.output_postprocessor:
logits = self.output_postprocessor(logits, modality_sizes=output_modality_sizes)
if not return_dict:
if logits is not None:
return (logits, sequence_output) + encoder_outputs[1:]
else:
return (sequence_output,) + encoder_outputs[1:]
return PerceiverModelOutput(
logits=logits,
last_hidden_state=sequence_output,
hidden_states=encoder_outputs.hidden_states,
attentions=encoder_outputs.attentions,
cross_attentions=encoder_outputs.cross_attentions,
)
@add_start_docstrings("""Example use of Perceiver for masked language modeling. """, PERCEIVER_START_DOCSTRING)
class PerceiverForMaskedLM(PerceiverPreTrainedModel):
def __init__(self, config):
super().__init__(config)
trainable_position_encoding_kwargs_decoder = dict(
num_channels=config.d_model, index_dims=config.max_position_embeddings
)
self.perceiver = PerceiverModel(
config,
input_preprocessor=PerceiverTextPreprocessor(config),
decoder=PerceiverBasicDecoder(
config,
output_num_channels=config.d_latents,
output_index_dims=config.max_position_embeddings, # we need to define the seq_len of the inputs beforehand
num_channels=config.d_model,
qk_channels=8 * 32,
v_channels=config.d_model,
num_heads=8,
use_query_residual=False,
final_project=False,
trainable_position_encoding_kwargs=trainable_position_encoding_kwargs_decoder,
),
)
self.embedding_decoder = PerceiverEmbeddingDecoder(config)
# Initialize weights and apply final processing
self.post_init()
@add_start_docstrings_to_model_forward(PERCEIVER_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@add_code_sample_docstrings(
processor_class=_TOKENIZER_FOR_DOC,
checkpoint=_CHECKPOINT_FOR_DOC,
output_type=PerceiverMaskedLMOutput,
config_class=_CONFIG_FOR_DOC,
)
def forward(
self,
inputs=None,
attention_mask=None,
head_mask=None,
output_attentions=None,
output_hidden_states=None,
labels=None,
return_dict=None,
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ...,
config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored
(masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]``
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
outputs = self.perceiver(
inputs=inputs,
attention_mask=attention_mask,
head_mask=head_mask,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
logits = self.embedding_decoder(
outputs.logits if return_dict else outputs[0], embedding_layer=self.perceiver.input_preprocessor.embeddings
)
masked_lm_loss = None
if labels is not None:
loss_fct = CrossEntropyLoss() # -100 index = padding token
masked_lm_loss = loss_fct(logits.view(-1, self.config.vocab_size), labels.view(-1))
if not return_dict:
output = (logits,) + outputs[2:]
return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output
return PerceiverMaskedLMOutput(
loss=masked_lm_loss,
logits=logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
cross_attentions=outputs.cross_attentions,
)
@add_start_docstrings("""Example use of Perceiver for text classification. """, PERCEIVER_START_DOCSTRING)
class PerceiverForSequenceClassification(PerceiverPreTrainedModel):
def __init__(self, config):
super().__init__(config)
trainable_position_encoding_kwargs_decoder = dict(num_channels=config.d_latents, index_dims=1)
self.num_labels = config.num_labels
self.perceiver = PerceiverModel(
config,
input_preprocessor=PerceiverTextPreprocessor(config),
decoder=PerceiverClassificationDecoder(
config,
num_channels=config.d_latents,
trainable_position_encoding_kwargs=trainable_position_encoding_kwargs_decoder,
use_query_residual=True,
),
)
# Initialize weights and apply final processing
self.post_init()
@add_start_docstrings_to_model_forward(PERCEIVER_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@add_code_sample_docstrings(
processor_class=_TOKENIZER_FOR_DOC,
checkpoint=_CHECKPOINT_FOR_DOC,
output_type=PerceiverClassifierOutput,
config_class=_CONFIG_FOR_DOC,
)
def forward(
self,
inputs=None,
attention_mask=None,
head_mask=None,
output_attentions=None,
output_hidden_states=None,
labels=None,
return_dict=None,
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the classification/regression loss. Indices should be in :obj:`[0, ...,
config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
Returns:
Examples::
>>> from transformers import PerceiverTokenizer, PerceiverForSequenceClassification
>>> tokenizer = PerceiverTokenizer.from_pretrained('deepmind/language-perceiver')
>>> model = PerceiverForSequenceClassification.from_pretrained('deepmind/language-perceiver')
>>> text = "hello world"
>>> inputs = tokenizer(images=image, return_tensors="pt").input_ids
>>> outputs = model(inputs=inputs)
>>> logits = outputs.logits
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
outputs = self.perceiver(
inputs=inputs,
attention_mask=attention_mask,
head_mask=head_mask,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
logits = outputs.logits if return_dict else outputs[0]
loss = None
if labels is not None:
if self.config.problem_type is None:
if self.num_labels == 1:
self.config.problem_type = "regression"
elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
self.config.problem_type = "single_label_classification"
else:
self.config.problem_type = "multi_label_classification"
if self.config.problem_type == "regression":
loss_fct = MSELoss()
if self.num_labels == 1:
loss = loss_fct(logits.squeeze(), labels.squeeze())
else:
loss = loss_fct(logits, labels)
elif self.config.problem_type == "single_label_classification":
loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
elif self.config.problem_type == "multi_label_classification":
loss_fct = BCEWithLogitsLoss()
loss = loss_fct(logits, labels)
if not return_dict:
output = (logits,) + outputs[2:]
return ((loss,) + output) if loss is not None else output
return PerceiverClassifierOutput(
loss=loss,
logits=logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
cross_attentions=outputs.cross_attentions,
)
@add_start_docstrings(
"""
Example use of Perceiver for image classification, for tasks such as ImageNet.
This model uses learned position embeddings. In other words, this model is not given any privileged information about
the structure of images. As shown in the paper, this model can achieve a top-1 accuracy of 72.7 on ImageNet.
`PerceiverForImageClassificationLearned` uses
`transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor` (with `prep_type` = "conv1x1") to
preprocess the input images, and `transformers.models.perceiver.modeling_perceiver.PerceiverClassificationDecoder` to
decode the latent representation of `~transformers.PerceiverModel` into classification logits.
""",
PERCEIVER_START_DOCSTRING,
)
class PerceiverForImageClassificationLearned(PerceiverPreTrainedModel):
def __init__(self, config):
super().__init__(config)
trainable_position_encoding_kwargs_preprocessor = dict(num_channels=256, index_dims=config.image_size ** 2)
trainable_position_encoding_kwargs_decoder = dict(num_channels=config.d_latents, index_dims=1)
self.num_labels = config.num_labels
self.perceiver = PerceiverModel(
config,
input_preprocessor=PerceiverImagePreprocessor(
config,
prep_type="conv1x1",
spatial_downsample=1,
out_channels=256,
position_encoding_type="trainable",
concat_or_add_pos="concat",
project_pos_dim=256,
trainable_position_encoding_kwargs=trainable_position_encoding_kwargs_preprocessor,
),
decoder=PerceiverClassificationDecoder(
config,
num_channels=config.d_latents,
trainable_position_encoding_kwargs=trainable_position_encoding_kwargs_decoder,
use_query_residual=True,
),
)
# Initialize weights and apply final processing
self.post_init()
@add_start_docstrings_to_model_forward(PERCEIVER_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@replace_return_docstrings(output_type=PerceiverClassifierOutput, config_class=_CONFIG_FOR_DOC)
def forward(
self,
inputs=None,
attention_mask=None,
head_mask=None,
output_attentions=None,
output_hidden_states=None,
labels=None,
return_dict=None,
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the image classification/regression loss. Indices should be in :obj:`[0, ...,
config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
Returns:
Examples::
>>> from transformers import PerceiverFeatureExtractor, PerceiverForImageClassificationLearned
>>> from PIL import Image
>>> import requests
>>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> feature_extractor = PerceiverFeatureExtractor.from_pretrained('deepmind/vision-perceiver-learned')
>>> model = PerceiverForImageClassificationLearned.from_pretrained('deepmind/vision-perceiver-learned')
>>> inputs = feature_extractor(images=image, return_tensors="pt").pixel_values
>>> outputs = model(inputs=inputs)
>>> logits = outputs.logits
>>> # model predicts one of the 1000 ImageNet classes
>>> predicted_class_idx = logits.argmax(-1).item()
>>> print("Predicted class:", model.config.id2label[predicted_class_idx])
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
outputs = self.perceiver(
inputs=inputs,
attention_mask=attention_mask,
head_mask=head_mask,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
logits = outputs.logits if return_dict else outputs[0]
loss = None
if labels is not None:
if self.config.problem_type is None:
if self.num_labels == 1:
self.config.problem_type = "regression"
elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
self.config.problem_type = "single_label_classification"
else:
self.config.problem_type = "multi_label_classification"
if self.config.problem_type == "regression":
loss_fct = MSELoss()
if self.num_labels == 1:
loss = loss_fct(logits.squeeze(), labels.squeeze())
else:
loss = loss_fct(logits, labels)
elif self.config.problem_type == "single_label_classification":
loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
elif self.config.problem_type == "multi_label_classification":
loss_fct = BCEWithLogitsLoss()
loss = loss_fct(logits, labels)
if not return_dict:
output = (logits,) + outputs[2:]
return ((loss,) + output) if loss is not None else output
return PerceiverClassifierOutput(
loss=loss,
logits=logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
cross_attentions=outputs.cross_attentions,
)
@add_start_docstrings(
"""
Example use of Perceiver for image classification, for tasks such as ImageNet.
This model uses fixed 2D Fourier position embeddings. As shown in the paper, this model can achieve a top-1 accuracy of
79.0 on ImageNet, and 84.5 when pre-trained on a large-scale dataset (i.e. JFT).
`PerceiverForImageClassificationLearned` uses
`transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor` (with `prep_type` = "pixels") to
preprocess the input images, and `transformers.models.perceiver.modeling_perceiver.PerceiverClassificationDecoder` to
decode the latent representation of `~transformers.PerceiverModel` into classification logits.
""",
PERCEIVER_START_DOCSTRING,
)
class PerceiverForImageClassificationFourier(PerceiverPreTrainedModel):
def __init__(self, config):
super().__init__(config)
fourier_position_encoding_kwargs_preprocessor = dict(
concat_pos=True, max_resolution=(224, 224), num_bands=64, sine_only=False
)
trainable_position_encoding_kwargs_decoder = dict(num_channels=config.d_latents, index_dims=1)
self.num_labels = config.num_labels
self.perceiver = PerceiverModel(
config,
input_preprocessor=PerceiverImagePreprocessor(
config,
prep_type="pixels",
spatial_downsample=1,
fourier_position_encoding_kwargs=fourier_position_encoding_kwargs_preprocessor,
),
decoder=PerceiverClassificationDecoder(
config,
num_channels=config.d_latents,
trainable_position_encoding_kwargs=trainable_position_encoding_kwargs_decoder,
use_query_residual=True,
),
)
# Initialize weights and apply final processing
self.post_init()
@add_start_docstrings_to_model_forward(PERCEIVER_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@replace_return_docstrings(output_type=PerceiverClassifierOutput, config_class=_CONFIG_FOR_DOC)
def forward(
self,
inputs=None,
attention_mask=None,
head_mask=None,
output_attentions=None,
output_hidden_states=None,
labels=None,
return_dict=None,
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the image classification/regression loss. Indices should be in :obj:`[0, ...,
config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
Returns:
Examples::
>>> from transformers import PerceiverFeatureExtractor, PerceiverForImageClassificationFourier
>>> from PIL import Image
>>> import requests
>>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> feature_extractor = PerceiverFeatureExtractor.from_pretrained('deepmind/vision-perceiver-fourier')
>>> model = PerceiverForImageClassificationFourier.from_pretrained('deepmind/vision-perceiver-fourier')
>>> inputs = feature_extractor(images=image, return_tensors="pt").pixel_values
>>> outputs = model(inputs=inputs)
>>> logits = outputs.logits
>>> # model predicts one of the 1000 ImageNet classes
>>> predicted_class_idx = logits.argmax(-1).item()
>>> print("Predicted class:", model.config.id2label[predicted_class_idx])
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
outputs = self.perceiver(
inputs=inputs,
attention_mask=attention_mask,
head_mask=head_mask,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
logits = outputs.logits if return_dict else outputs[0]
loss = None
if labels is not None:
if self.config.problem_type is None:
if self.num_labels == 1:
self.config.problem_type = "regression"
elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
self.config.problem_type = "single_label_classification"
else:
self.config.problem_type = "multi_label_classification"
if self.config.problem_type == "regression":
loss_fct = MSELoss()
if self.num_labels == 1:
loss = loss_fct(logits.squeeze(), labels.squeeze())
else:
loss = loss_fct(logits, labels)
elif self.config.problem_type == "single_label_classification":
loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
elif self.config.problem_type == "multi_label_classification":
loss_fct = BCEWithLogitsLoss()
loss = loss_fct(logits, labels)
if not return_dict:
output = (logits,) + outputs[2:]
return ((loss,) + output) if loss is not None else output
return PerceiverClassifierOutput(
loss=loss,
logits=logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
cross_attentions=outputs.cross_attentions,
)
@add_start_docstrings(
"""
Example use of Perceiver for image classification, for tasks such as ImageNet.
This model uses a 2D conv+maxpool preprocessing network. As shown in the paper, this model can achieve a top-1 accuracy
of 82.1 on ImageNet.
`PerceiverForImageClassificationLearned` uses
`transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor` (with `prep_type` = "conv") to preprocess
the input images, and `transformers.models.perceiver.modeling_perceiver.PerceiverClassificationDecoder` to decode the
latent representation of `~transformers.PerceiverModel` into classification logits.
""",
PERCEIVER_START_DOCSTRING,
)
class PerceiverForImageClassificationConvProcessing(PerceiverPreTrainedModel):
def __init__(self, config):
super().__init__(config)
fourier_position_encoding_kwargs_preprocessor = dict(
concat_pos=True, max_resolution=(56, 56), num_bands=64, sine_only=False
)
trainable_position_encoding_kwargs_decoder = dict(num_channels=config.d_latents, index_dims=1)
self.num_labels = config.num_labels
self.perceiver = PerceiverModel(
config,
input_preprocessor=PerceiverImagePreprocessor(
config,
prep_type="conv",
spatial_downsample=1,
position_encoding_type="fourier",
fourier_position_encoding_kwargs=fourier_position_encoding_kwargs_preprocessor,
),
decoder=PerceiverClassificationDecoder(
config,
num_channels=config.d_latents,
trainable_position_encoding_kwargs=trainable_position_encoding_kwargs_decoder,
use_query_residual=True,
),
)
# Initialize weights and apply final processing
self.post_init()
@add_start_docstrings_to_model_forward(PERCEIVER_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@replace_return_docstrings(output_type=PerceiverClassifierOutput, config_class=_CONFIG_FOR_DOC)
def forward(
self,
inputs=None,
attention_mask=None,
head_mask=None,
output_attentions=None,
output_hidden_states=None,
labels=None,
return_dict=None,
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the image classification/regression loss. Indices should be in :obj:`[0, ...,
config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
Returns:
Examples::
>>> from transformers import PerceiverFeatureExtractor, PerceiverForImageClassificationConvProcessing
>>> from PIL import Image
>>> import requests
>>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> feature_extractor = PerceiverFeatureExtractor.from_pretrained('deepmind/vision-perceiver-conv')
>>> model = PerceiverForImageClassificationConvProcessing.from_pretrained('deepmind/vision-perceiver-conv')
>>> inputs = feature_extractor(images=image, return_tensors="pt").pixel_values
>>> outputs = model(inputs=inputs)
>>> logits = outputs.logits
>>> # model predicts one of the 1000 ImageNet classes
>>> predicted_class_idx = logits.argmax(-1).item()
>>> print("Predicted class:", model.config.id2label[predicted_class_idx])
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
outputs = self.perceiver(
inputs=inputs,
attention_mask=attention_mask,
head_mask=head_mask,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
logits = outputs.logits if return_dict else outputs[0]
loss = None
if labels is not None:
if self.config.problem_type is None:
if self.num_labels == 1:
self.config.problem_type = "regression"
elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
self.config.problem_type = "single_label_classification"
else:
self.config.problem_type = "multi_label_classification"
if self.config.problem_type == "regression":
loss_fct = MSELoss()
if self.num_labels == 1:
loss = loss_fct(logits.squeeze(), labels.squeeze())
else:
loss = loss_fct(logits, labels)
elif self.config.problem_type == "single_label_classification":
loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
elif self.config.problem_type == "multi_label_classification":
loss_fct = BCEWithLogitsLoss()
loss = loss_fct(logits, labels)
if not return_dict:
output = (logits,) + outputs[2:]
return ((loss,) + output) if loss is not None else output
return PerceiverClassifierOutput(
loss=loss,
logits=logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
cross_attentions=outputs.cross_attentions,
)
@add_start_docstrings(
"""
Example use of Perceiver for optical flow, for tasks such as Sintel and KITTI. `PerceiverForOpticalFlow` uses
`transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor` (with `prep_type` = "patches") to
preprocess the input images, and `transformers.models.perceiver.modeling_perceiver.PerceiverOpticalFlowDecoder` to
decode the latent representation of `~transformers.PerceiverModel`.
As input, one concatenates 2 subsequent frames along the channel dimension and extract a 3 x 3 patch around each pixel
(leading to 3 x 3 x 3 x 2 = 54 values for each pixel). Fixed Fourier position encodings are used to encode the position
of each pixel in the patch. Next, one applies the Perceiver encoder. To decode, one queries the latent representation
using the same encoding used for the input.
""",
PERCEIVER_START_DOCSTRING,
)
class PerceiverForOpticalFlow(PerceiverPreTrainedModel):
def __init__(self, config):
super().__init__(config)
fourier_position_encoding_kwargs_preprocessor = dict(
num_bands=64,
max_resolution=config.train_size,
sine_only=False,
concat_pos=True,
)
fourier_position_encoding_kwargs_decoder = dict(
concat_pos=True, max_resolution=config.train_size, num_bands=64, sine_only=False
)
self.perceiver = PerceiverModel(
config,
input_preprocessor=PerceiverImagePreprocessor(
config,
prep_type="patches",
spatial_downsample=1,
conv_after_patching=True,
conv_after_patching_in_channels=54,
temporal_downsample=2,
position_encoding_type="fourier",
# position_encoding_kwargs
fourier_position_encoding_kwargs=fourier_position_encoding_kwargs_preprocessor,
),
decoder=PerceiverOpticalFlowDecoder(
config,
num_channels=config.d_model,
output_image_shape=config.train_size,
rescale_factor=100.0,
# decoder kwargs
use_query_residual=False,
output_num_channels=2,
# We query the decoder using the first frame features
# rather than a standard decoder position encoding.
position_encoding_type="fourier",
fourier_position_encoding_kwargs=fourier_position_encoding_kwargs_decoder,
),
)
# Initialize weights and apply final processing
self.post_init()
@add_start_docstrings_to_model_forward(PERCEIVER_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@replace_return_docstrings(output_type=PerceiverClassifierOutput, config_class=_CONFIG_FOR_DOC)
def forward(
self,
inputs=None,
attention_mask=None,
head_mask=None,
output_attentions=None,
output_hidden_states=None,
labels=None,
return_dict=None,
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the optical flow loss. Indices should be in :obj:`[0, ..., config.num_labels - 1]`.
Returns:
Examples::
>>> from transformers import PerceiverForOpticalFlow
>>> import torch
>>> model = PerceiverForOpticalFlow.from_pretrained('deepmind/optical-flow-perceiver')
>>> # in the Perceiver IO paper, the authors extract a 3 x 3 patch around each pixel,
>>> # leading to 3 x 3 x 3 = 27 values for each pixel (as each pixel also has 3 color channels)
>>> # patches have shape (batch_size, num_frames, num_channels, height, width)
>>> # the authors train on resolutions of 368 x 496
>>> patches = torch.randn(1, 2, 27, 368, 496)
>>> outputs = model(inputs=patches)
>>> logits = outputs.logits
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
outputs = self.perceiver(
inputs=inputs,
attention_mask=attention_mask,
head_mask=head_mask,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
logits = outputs.logits if return_dict else outputs[0]
loss = None
if labels is not None:
raise NotImplementedError("Optical flow training is not yet supported")
if not return_dict:
output = (logits,) + outputs[2:]
return ((loss,) + output) if loss is not None else output
return PerceiverClassifierOutput(
loss=loss,
logits=logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
cross_attentions=outputs.cross_attentions,
)
@add_start_docstrings(
"""
Example use of Perceiver for multimodal (video) autoencoding, for tasks such as Kinetics-700.
`PerceiverForMultimodalAutoencoding` uses
`transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPreprocessor` to preprocess the 3 modalities:
images, audio and class labels. This preprocessor uses modality-specific preprocessors to preprocess every modality
separately, after which they are concatenated. Trainable position embeddings are used to pad each modality to the same
number of channels to make concatenation along the time dimension possible. Next, one applies the Perceiver encoder.
`transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalDecoder` is used to decode the latent
representation of `~transformers.PerceiverModel`. This decoder uses each modality-specific decoder to construct
queries. The decoder queries are created based on the inputs after preprocessing. However, autoencoding an entire video
in a single forward pass is computationally infeasible, hence one only uses parts of the decoder queries to do
cross-attention with the latent representation. This is determined by the subsampled indices for each modality, which
can be provided as additional input to the forward pass of `PerceiverForMultimodalAutoencoding`.
`transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalDecoder` also pads the decoder queries of the
different modalities to the same number of channels, in order to concatenate them along the time dimension. Next,
cross-attention is performed with the latent representation of `PerceiverModel`.
Finally, `transformers.models.perceiver.modeling_perceiver.PerceiverMultiModalPostprocessor` is used to turn this
tensor into an actual video. It first splits up the output into the different modalities, and then applies the
respective postprocessor for each modality.
Note that, by masking the classification label during evaluation (i.e. simply providing a tensor of zeros for the
"label" modality), this auto-encoding model becomes a Kinetics 700 video classifier.
""",
PERCEIVER_START_DOCSTRING,
)
class PerceiverForMultimodalAutoencoding(PerceiverPreTrainedModel):
def __init__(self, config):
super().__init__(config)
n_audio_samples = config.num_frames * config.audio_samples_per_frame
input_preprocessor = PerceiverMultimodalPreprocessor(
min_padding_size=4,
modalities={
"audio": PerceiverAudioPreprocessor(
config,
position_encoding_type="fourier",
fourier_position_encoding_kwargs=dict(
num_bands=192,
max_resolution=(n_audio_samples,),
sine_only=False,
concat_pos=True,
),
prep_type="patches",
samples_per_patch=config.samples_per_patch,
),
"image": PerceiverImagePreprocessor(
config,
position_encoding_type="fourier",
fourier_position_encoding_kwargs=dict(
num_bands=32,
max_resolution=(config.num_frames, config.image_size, config.image_size),
sine_only=False,
concat_pos=True,
),
prep_type="patches",
spatial_downsample=4,
temporal_downsample=1,
),
"label": PerceiverOneHotPreprocessor(config),
},
mask_probs={"image": 0.0, "audio": 0.0, "label": 1.0},
)
image_decoder = PerceiverBasicVideoAutoencodingDecoder(
config,
# Autoencoding, don't pass inputs to the queries.
concat_preprocessed_input=False,
output_shape=config.output_shape,
output_num_channels=512,
use_query_residual=False,
position_encoding_only=True,
position_encoding_type="fourier",
fourier_position_encoding_kwargs=dict(
num_bands=32,
max_resolution=(config.num_frames, config.image_size, config.image_size),
sine_only=False,
concat_pos=True,
),
)
decoder = PerceiverMultimodalDecoder(
config,
# Autoencoding, don't pass inputs to the queries.
concat_preprocessed_input=False,
# Modality specific decoders are used ONLY to generate queries.
# All modalties are decoded together using a unified decoder.
modalities={
"audio": PerceiverBasicDecoder(
config,
# Autoencoding, don't pass inputs to the queries.
concat_preprocessed_input=False,
output_index_dims=(n_audio_samples // config.samples_per_patch,),
output_num_channels=512,
use_query_residual=False,
position_encoding_only=True,
position_encoding_type="fourier",
fourier_position_encoding_kwargs=dict(
num_bands=192,
max_resolution=(n_audio_samples,),
sine_only=False,
concat_pos=True,
),
),
"image": image_decoder,
"label": PerceiverClassificationDecoder(
config,
# Autoencoding, don't pass inputs to the queries.
concat_preprocessed_input=False,
use_query_residual=False,
position_encoding_only=True,
position_encoding_type="trainable",
trainable_position_encoding_kwargs=dict(
num_channels=1024,
index_dims=1,
),
),
},
num_outputs=None,
output_num_channels=512,
use_query_residual=False,
)
output_postprocessor = PerceiverMultimodalPostprocessor(
modalities={
"audio": PerceiverAudioPostprocessor(config, in_channels=512),
"image": PerceiverProjectionPostprocessor(in_channels=512, out_channels=3),
"label": PerceiverClassificationPostprocessor(config, in_channels=512),
}
)
self.perceiver = PerceiverModel(
config,
input_preprocessor=input_preprocessor,
decoder=decoder,
output_postprocessor=output_postprocessor,
)
# Initialize weights and apply final processing
self.post_init()
@add_start_docstrings_to_model_forward(PERCEIVER_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@replace_return_docstrings(output_type=PerceiverClassifierOutput, config_class=_CONFIG_FOR_DOC)
def forward(
self,
inputs=None,
attention_mask=None,
subsampled_output_points=None,
head_mask=None,
output_attentions=None,
output_hidden_states=None,
labels=None,
return_dict=None,
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the image classification/regression loss. Indices should be in :obj:`[0, ...,
config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
Returns:
Examples::
>>> from transformers import PerceiverForMultimodalAutoencoding
>>> import torch
>>> images = torch.randn((1, 16, 3, 224, 224))
>>> audio = torch.randn((1, 30720, 1))
>>> inputs = dict(image=images, audio=audio, label=torch.zeros((images.shape[0], 700)))
>>> model = PerceiverForMultimodalAutoencoding.from_pretrained('deepmind/multimodal-perceiver')
>>> outputs = model(inputs=inputs)
>>> logits = outputs.logits
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
outputs = self.perceiver(
inputs=inputs,
attention_mask=attention_mask,
subsampled_output_points=subsampled_output_points,
head_mask=head_mask,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
logits = outputs.logits if return_dict else outputs[0]
loss = None
if labels is not None:
raise NotImplementedError("Multimodal autoencoding training is not yet supported")
if not return_dict:
output = (logits,) + outputs[2:]
return ((loss,) + output) if loss is not None else output
return PerceiverClassifierOutput(
loss=loss,
logits=logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
cross_attentions=outputs.cross_attentions,
)
# Below: position encodings
def build_position_encoding(
position_encoding_type,
out_channels=None,
project_pos_dim=-1,
trainable_position_encoding_kwargs=None,
fourier_position_encoding_kwargs=None,
):
"""
Builds the position encoding.
Args:
- out_channels: refers to the number of channels of the position encodings.
- project_pos_dim: if specified, will project the position encodings to this dimension.
"""
if position_encoding_type == "trainable":
if not trainable_position_encoding_kwargs:
raise ValueError("Make sure to pass trainable_position_encoding_kwargs")
output_pos_enc = PerceiverTrainablePositionEncoding(**trainable_position_encoding_kwargs)
elif position_encoding_type == "fourier":
# We don't use the index_dims argument, as this is only known during the forward pass
if not fourier_position_encoding_kwargs:
raise ValueError("Make sure to pass fourier_position_encoding_kwargs")
output_pos_enc = PerceiverFourierPositionEncoding(**fourier_position_encoding_kwargs)
else:
raise ValueError(f"Unknown position encoding type: {position_encoding_type}.")
# Optionally, project the position encoding to a target dimension:
positions_projection = nn.Linear(out_channels, project_pos_dim) if project_pos_dim > 0 else nn.Identity()
return output_pos_enc, positions_projection
# Below: Perceiver decoders
class PerceiverAbstractDecoder(nn.Module, metaclass=abc.ABCMeta):
"""Perceiver abstract decoder."""
@abc.abstractmethod
def decoder_query(self, inputs, modality_sizes=None, inputs_without_pos=None, subsampled_points=None):
raise NotImplementedError
@property
@abc.abstractmethod
def num_query_channels(self):
raise NotImplementedError
@abc.abstractmethod
def forward(self, query, z, query_mask=None):
raise NotImplementedError
class PerceiverProjectionDecoder(PerceiverAbstractDecoder):
"""Baseline projection decoder (no cross-attention)."""
def __init__(self, config):
super().__init__()
self.classifier = nn.Linear(config.d_latents, config.num_labels)
def decoder_query(self, inputs, modality_sizes=None, inputs_without_pos=None, subsampled_points=None):
return None
def forward(self, query, z, query_mask=None):
# (batch_size, num_latents, d_latents) -> (batch_size, d_latents)
z = torch.mean(z, dim=1)
# (batch_size, d_latents) -> (batch_size, config.num_labels)
logits = self.classifier(z)
return logits
class PerceiverBasicDecoder(PerceiverAbstractDecoder):
"""
Cross-attention-based decoder.
Here, `output_num_channels` refers to the number of output channels. `num_channels` refers to the number of
channels of the output queries.
"""
def __init__(
self,
config,
output_num_channels,
position_encoding_type="trainable",
# The following 2 arguments are ignored if position_encoding_type == 'none':
output_index_dims=None,
num_channels=128,
subsampled_index_dims=None,
qk_channels=None,
v_channels=None,
num_heads=1,
widening_factor=1,
use_query_residual=False,
concat_preprocessed_input=False,
final_project=True,
position_encoding_only=False,
**position_encoding_kwargs,
):
super().__init__()
self.output_num_channels = output_num_channels
# If `none`, the decoder will not construct any position encodings.
# You should construct your own when quering the decoder.
self.output_position_encodings = None
self.position_encoding_type = position_encoding_type
self.position_encoding_kwargs = position_encoding_kwargs
if position_encoding_type != "none":
self.output_position_encodings, self.positions_projection = build_position_encoding(
position_encoding_type=position_encoding_type, **position_encoding_kwargs
)
self.output_index_dims = output_index_dims
self.num_channels = num_channels
if subsampled_index_dims is None:
subsampled_index_dims = output_index_dims
self.subsampled_index_dims = subsampled_index_dims
self.concat_preprocessed_input = concat_preprocessed_input
self.final_project = final_project
self.position_encoding_only = position_encoding_only
# for multimodal autoencoding, we don't need the decoder cross-attention and final layer
# so then we will set position_encoding_only to True
if not self.position_encoding_only:
self.decoding_cross_attention = PerceiverLayer(
config,
is_cross_attention=True,
qk_channels=qk_channels,
v_channels=v_channels,
num_heads=num_heads,
q_dim=num_channels,
kv_dim=config.d_latents,
widening_factor=widening_factor,
use_query_residual=use_query_residual,
)
self.final_layer = nn.Linear(num_channels, output_num_channels) if final_project else nn.Identity()
@property
def num_query_channels(self) -> int:
if self.position_encoding_type == "none": # Queries come from elsewhere
raise ValueError(
"You cannot calculate number of decoder query channels when position_encoding_type is set to none"
)
if self.position_encoding_only:
if "project_pos_dim" in self.position_encoding_kwargs:
return self.position_encoding_kwargs["project_pos_dim"]
return self.output_position_encodings.output_size()
if self.final_project:
return self.output_num_channels
return self.num_channels
def decoder_query(self, inputs, modality_sizes=None, inputs_without_pos=None, subsampled_points=None):
if self.position_encoding_type == "none": # Queries come from elsewhere
raise ValueError("You cannot construct decoder queries when position_encoding_type is set to none")
if subsampled_points is not None:
# subsampled_points are the indices if the inputs would be flattened
# however, the inputs aren't flattened, that's why we use unravel_index
# to get the indices for the unflattened array
# unravel_index returns a tuple (x_idx, y_idx, ...)
# stack to get the [n, d] tensor of coordinates
indices = list(torch.from_numpy(x) for x in np.unravel_index(subsampled_points, self.output_index_dims))
pos = torch.stack(indices, dim=1)
batch_size = inputs.shape[0]
# Map these coordinates to [-1, 1]
pos = -1 + 2 * pos / torch.tensor(self.output_index_dims)[None, :]
pos = torch.broadcast_to(pos[None], [batch_size, pos.shape[0], pos.shape[1]])
# Construct the position encoding.
if self.position_encoding_type == "trainable":
pos_emb = self.output_position_encodings(batch_size)
elif self.position_encoding_type == "fourier":
pos_emb = self.output_position_encodings(
self.output_index_dims, batch_size=batch_size, device=inputs.device, pos=pos
)
# Optionally project them to a target dimension.
pos_emb = self.positions_projection(pos_emb)
pos_emb = torch.reshape(pos_emb, [pos_emb.shape[0], -1, pos_emb.shape[-1]])
else:
batch_size = inputs.shape[0]
index_dims = inputs.shape[2:]
# Construct the position encoding.
if self.position_encoding_type == "trainable":
pos_emb = self.output_position_encodings(batch_size)
elif self.position_encoding_type == "fourier":
pos_emb = self.output_position_encodings(index_dims, batch_size, device=inputs.device)
# Optionally project them to a target dimension.
pos_emb = self.positions_projection(pos_emb)
if self.concat_preprocessed_input:
if inputs_without_pos is None:
raise ValueError("Value is required for inputs_without_pos if concat_preprocessed_input is True")
pos_emb = torch.cat([inputs_without_pos, pos_emb], div=-1)
return pos_emb
def forward(self, query, z, query_mask=None, output_attentions=False):
# Cross-attention decoding.
# key, value: B x N x K; query: B x M x K
# Attention maps -> B x N x M
# Output -> B x M x K
cross_attentions = () if output_attentions else None
layer_outputs = self.decoding_cross_attention(
query,
attention_mask=query_mask,
head_mask=None,
inputs=z,
inputs_mask=None,
output_attentions=output_attentions,
)
output = layer_outputs[0]
if output_attentions:
cross_attentions = cross_attentions + (layer_outputs[1],)
logits = self.final_layer(output)
return PerceiverDecoderOutput(logits=logits, cross_attentions=cross_attentions)
class PerceiverClassificationDecoder(PerceiverAbstractDecoder):
"""
Cross-attention based classification decoder. Light-weight wrapper of `BasicDecoder` for logit output. Will turn
the output of the Perceiver encoder which is of shape (batch_size, num_latents, d_latents) to a tensor of shape
(batch_size, num_labels). The queries are of shape (batch_size, 1, num_labels).
"""
def __init__(self, config, **decoder_kwargs):
super().__init__()
self.num_labels = config.num_labels
self.decoder = PerceiverBasicDecoder(
config,
output_num_channels=self.num_labels,
output_index_dims=1, # Predict a single logit array.
**decoder_kwargs,
)
@property
def num_query_channels(self) -> int:
return self.decoder.num_query_channels
def decoder_query(self, inputs, modality_sizes=None, inputs_without_pos=None, subsampled_points=None):
return self.decoder.decoder_query(
inputs, modality_sizes, inputs_without_pos, subsampled_points=subsampled_points
)
def forward(self, query, z, query_mask=None, output_attentions=False):
decoder_outputs = self.decoder(query, z, output_attentions=output_attentions)
# B x 1 x num_classes -> B x num_classes
logits = decoder_outputs.logits[:, 0, :]
return PerceiverDecoderOutput(logits=logits, cross_attentions=decoder_outputs.cross_attentions)
class PerceiverOpticalFlowDecoder(PerceiverAbstractDecoder):
"""Cross-attention based optical flow decoder."""
def __init__(self, config, output_image_shape, output_num_channels=2, rescale_factor=100.0, **decoder_kwargs):
super().__init__()
self.output_image_shape = output_image_shape
self.output_num_channels = output_num_channels
self.rescale_factor = rescale_factor
self.decoder = PerceiverBasicDecoder(config, output_num_channels=output_num_channels, **decoder_kwargs)
@property
def num_query_channels(self) -> int:
return self.decoder.num_query_channels
def decoder_query(self, inputs, modality_sizes=None, inputs_without_pos=None, subsampled_points=None):
if subsampled_points is not None:
raise ValueError("FlowDecoder doesn't support subsampling yet.")
return inputs
def forward(self, query, z, query_mask=None, output_attentions=False):
decoder_outputs = self.decoder(query, z, output_attentions=output_attentions)
preds = decoder_outputs.logits
# Output flow and rescale.
preds /= self.rescale_factor
preds = preds.reshape([preds.shape[0]] + list(self.output_image_shape) + [preds.shape[-1]])
return PerceiverDecoderOutput(logits=preds, cross_attentions=decoder_outputs.cross_attentions)
class PerceiverBasicVideoAutoencodingDecoder(PerceiverAbstractDecoder):
"""
Cross-attention based video-autoencoding decoder. Light-weight wrapper of `BasicDecoder` with video reshaping
logic.
"""
def __init__(self, config, output_shape, position_encoding_type, **decoder_kwargs):
super().__init__()
if len(output_shape) != 4: # B, T, H, W
raise ValueError(f"Expected rank 4 output_shape, got {output_shape}.")
# Build the decoder components:
self.output_shape = output_shape
self.output_num_channels = decoder_kwargs["output_num_channels"]
self.decoder = PerceiverBasicDecoder(
config,
output_index_dims=self.output_shape[1:4], # T*H*W
position_encoding_type=position_encoding_type,
**decoder_kwargs,
)
@property
def num_query_channels(self) -> int:
return self.decoder.num_query_channels
def decoder_query(self, inputs, modality_sizes=None, inputs_without_pos=None, subsampled_points=None):
return self.decoder.decoder_query(
inputs,
modality_sizes=modality_sizes,
inputs_without_pos=inputs_without_pos,
subsampled_points=subsampled_points,
)
def forward(self, query, z, query_mask=None):
decoder_outputs = self.decoder(query, z)
logits = decoder_outputs.logits
logits = torch.reshape(logits, self.output_shape + [logits.shape[-1]])
return PerceiverDecoderOutput(logits=logits, cross_attentions=decoder_outputs.cross_attentions)
def restructure(modality_sizes: ModalitySizeType, inputs: torch.Tensor) -> Mapping[str, torch.Tensor]:
"""
Partitions a [B, N, C] tensor into tensors for each modality.
Args:
modality_sizes
dict specifying the size of the modality
inputs:
input tensor
Returns:
dict mapping name of modality to its associated tensor.
"""
outputs = {}
index = 0
# Apply a predictable ordering to the modalities
for modality in sorted(modality_sizes.keys()):
size = modality_sizes[modality]
inp = inputs[:, index : index + size]
index += size
outputs[modality] = inp
return outputs
class PerceiverMultimodalDecoder(PerceiverAbstractDecoder):
"""
Multimodal decoding by composing uni-modal decoders. The modalities argument of the constructor is a dictionary
mapping modality name to the decoder of that modality. That decoder will be used to construct queries for that
modality. However, there is a shared cross attention across all modalities, using the concatenated per-modality
query vectors.
"""
def __init__(
self,
config,
modalities,
num_outputs,
output_num_channels,
min_padding_size=2,
subsampled_index_dims=None,
**decoder_kwargs
):
super().__init__()
self.modalities = nn.ModuleDict(modalities)
self.subsampled_index_dims = subsampled_index_dims
self.min_padding_size = min_padding_size
self.output_num_channels = output_num_channels
self.num_outputs = num_outputs
self.decoder = PerceiverBasicDecoder(
config,
output_index_dims=(num_outputs,),
output_num_channels=output_num_channels,
position_encoding_type="none",
num_channels=self.num_query_channels,
**decoder_kwargs,
)
self.padding = nn.ParameterDict(
{
modality: nn.Parameter(torch.randn(1, self.num_query_channels - decoder.num_query_channels))
for modality, decoder in modalities.items()
}
)
@property
def num_query_channels(self) -> int:
max_channel_size = max(decoder.num_query_channels for _, decoder in self.modalities.items())
common_channel_size = max_channel_size + self.min_padding_size
return common_channel_size
def decoder_query(self, inputs, modality_sizes, inputs_without_pos=None, subsampled_points=None):
# Partition the flat inputs among the different modalities
inputs = restructure(modality_sizes, inputs)
# Obtain modality-specific decoders' queries
subsampled_points = subsampled_points or dict()
decoder_queries = dict()
for modality, decoder in self.modalities.items():
# Get input_without_pos for this modality if it exists.
input_without_pos = None
if inputs_without_pos is not None:
input_without_pos = inputs_without_pos.get(modality, None)
query = decoder.decoder_query(
inputs=inputs[modality],
modality_sizes=None,
inputs_without_pos=input_without_pos,
subsampled_points=subsampled_points.get(modality, None),
)
decoder_queries[modality] = query
# Pad all queries with trainable position encodings to make them have the same channels
def embed(modality, x):
x = torch.reshape(x, [x.shape[0], np.prod(x.shape[1:-1]), x.shape[-1]])
pos = self.padding[modality]
pos = torch.broadcast_to(pos, [x.shape[0], x.shape[1], self.num_query_channels - x.shape[2]])
return torch.cat([x, pos], dim=2)
# Apply a predictable ordering to the modalities
return torch.cat(
[embed(modality, decoder_queries[modality]) for modality in sorted(self.modalities.keys())], dim=1
)
def forward(self, query, z, query_mask=None, output_attentions=False):
# B x 1 x num_classes -> B x num_classes
decoder_outputs = self.decoder(query, z, output_attentions=output_attentions)
return decoder_outputs
# Below: IO pre- and post-processor classes for Perceiver.
def space_to_depth(frames: torch.Tensor, temporal_block_size: int = 1, spatial_block_size: int = 1) -> torch.Tensor:
"""
Space to depth transform. Rearranges blocks of spatial data, into depth.
This function assumes the channels to be first, but will place the channels last after transformation.
Based on https://discuss.pytorch.org/t/is-there-any-layer-like-tensorflows-space-to-depth-function/3487/15.
"""
if len(frames.shape) == 4:
batch_size, num_channels, height, width = frames.shape
# split up dimensions (height by spatial_block_size, width by spatial_block_size)
frames = frames.view(
batch_size,
num_channels,
height // spatial_block_size,
spatial_block_size,
width // spatial_block_size,
spatial_block_size,
)
# move blocks to last dimension: (batch_size, H//bs, W//bs, bs, bs, C)
frames = frames.permute(0, 2, 4, 3, 5, 1).contiguous()
# concatenate blocks along channel dimension: (batch_size, H//bs, W//bs, bs*bs*C)
frames = frames.view(
batch_size,
height // spatial_block_size,
width // spatial_block_size,
(spatial_block_size ** 2) * num_channels,
)
return frames
elif len(frames.shape) == 5:
batch_size, time, num_channels, height, width = frames.shape
# split up dimensions (time by temporal_block_size, height by spatial_block_size, width by spatial_block_size)
frames = frames.view(
batch_size,
time // temporal_block_size,
temporal_block_size,
num_channels,
height // spatial_block_size,
spatial_block_size,
width // spatial_block_size,
spatial_block_size,
)
# move blocks to last dimension: (batch_size, T//ts, H//bs, W//bs, ts, bs, bs, C)
frames = frames.permute(0, 1, 4, 6, 2, 5, 7, 3).contiguous()
# concatenate blocks along channel dimension: (batch_size, T//ts, H//bs, W//bs, ts*bs*bs*C)
frames = frames.view(
batch_size,
time // temporal_block_size,
height // spatial_block_size,
width // spatial_block_size,
temporal_block_size * (spatial_block_size ** 2) * num_channels,
)
return frames
else:
raise ValueError(
"Frames should be of rank 4 (batch, channels, height, width)"
" or rank 5 (batch, time, channels, height, width)"
)
class Conv2dSamePadding(nn.Conv2d):
"""
Conv2d layer with padding="same" support. Source:
https://gist.github.com/sumanmichael/4de9dee93f972d47c80c4ade8e149ea6
"""
def __init__(self, *args, **kwargs):
super(Conv2dSamePadding, self).__init__(*args, **kwargs)
self.zero_pad_2d = nn.ZeroPad2d(
reduce(__add__, [(k // 2 + (k - 2 * (k // 2)) - 1, k // 2) for k in self.kernel_size[::-1]])
)
def forward(self, input):
return self._conv_forward(self.zero_pad_2d(input), self.weight, self.bias)
class Conv2DDownsample(nn.Module):
"""Downsamples 4x by applying a 2D convolution and doing max pooling."""
def __init__(
self,
num_layers: int = 1,
in_channels: int = 3,
out_channels: int = 64,
use_batchnorm: bool = True,
):
"""
Constructs a Conv2DDownsample model.
Args:
in_channels (:obj:`int`, `optional`, defaults to 3):
The number of input channels.
out_channels (:obj:`int`, `optional`, defaults to 64):
The number of conv output channels.
use_batchnorm (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether to use batchnorm.
"""
super().__init__()
self.conv = Conv2dSamePadding(
in_channels=in_channels, out_channels=out_channels, kernel_size=7, stride=2, bias=False
)
self.batchnorm = nn.BatchNorm2d(num_features=out_channels) if use_batchnorm else nn.Identity()
self.relu = nn.ReLU()
self.max_pool = nn.MaxPool2d(kernel_size=3, stride=2)
def forward(self, inputs: torch.Tensor) -> torch.Tensor:
out = self.conv(inputs)
out = self.batchnorm(out)
out = self.relu(out)
out = self.max_pool(out)
return out
def generate_fourier_features(pos, num_bands, max_resolution=(224, 224), concat_pos=True, sine_only=False):
"""
Generate a Fourier frequency position encoding with linear spacing.
Args:
pos (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length, dim)`):
The Tensor containing the position of n points in d dimensional space.
num_bands (:obj:`int`):
The number of frequency bands (K) to use.
max_resolution (:obj:`Tuple[int]`, `optional`, defaults to (224, 224)):
The maximum resolution (i.e. the number of pixels per dim). A tuple representing resolution for each dimension.
concat_pos (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether to concatenate the input position encoding to the Fourier features.
sine_only (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to use a single phase (sin) or two (sin/cos) for each frequency band.
Returns:
:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, n_channels)`: The Fourier position
embeddings. If :obj:`concat_pos` is `True` and :obj:`sine_only` is `False`, output dimensions are ordered as:
[dim_1, dim_2, ..., dim_d, sin(pi*f_1*dim_1), ..., sin(pi*f_K*dim_1), ..., sin(pi*f_1*dim_d), ...,
sin(pi*f_K*dim_d), cos(pi*f_1*dim_1), ..., cos(pi*f_K*dim_1), ..., cos(pi*f_1*dim_d), ..., cos(pi*f_K*dim_d)],
where dim_i is pos[:, i] and f_k is the kth frequency band.
"""
batch_size = pos.shape[0]
min_freq = 1.0
# Nyquist frequency at the target resolution:
freq_bands = torch.stack(
[torch.linspace(start=min_freq, end=res / 2, steps=num_bands) for res in max_resolution], dim=0
)
# Get frequency bands for each spatial dimension.
# Output is size [n, d * num_bands]
per_pos_features = pos[0, :, :][:, :, None] * freq_bands[None, :, :]
per_pos_features = torch.reshape(per_pos_features, [-1, np.prod(per_pos_features.shape[1:])])
if sine_only:
# Output is size [n, d * num_bands]
per_pos_features = torch.sin(np.pi * (per_pos_features))
else:
# Output is size [n, 2 * d * num_bands]
per_pos_features = torch.cat(
[torch.sin(np.pi * per_pos_features), torch.cos(np.pi * per_pos_features)], dim=-1
)
# Concatenate the raw input positions.
if concat_pos:
# Adds d bands to the encoding.
per_pos_features = torch.cat([pos, per_pos_features.expand(batch_size, -1, -1)], dim=-1)
return per_pos_features
def build_linear_positions(index_dims, output_range=(-1.0, 1.0)):
"""
Generate an array of position indices for an N-D input array.
Args:
index_dims (:obj:`List[int]`):
The shape of the index dimensions of the input array.
output_range (:obj:`Tuple[float]`, `optional`, defaults to :obj:`(-1.0, 1.0)`):
The min and max values taken by each input index dimension.
Returns:
:obj:`torch.FloatTensor` of shape :obj:`(index_dims[0], index_dims[1], .., index_dims[-1], N)`.
"""
def _linspace(n_xels_per_dim):
return torch.linspace(start=output_range[0], end=output_range[1], steps=n_xels_per_dim, dtype=torch.float32)
dim_ranges = [_linspace(n_xels_per_dim) for n_xels_per_dim in index_dims]
array_index_grid = torch.meshgrid(*dim_ranges)
return torch.stack(array_index_grid, dim=-1)
class PerceiverAbstractPositionEncoding(nn.Module, metaclass=abc.ABCMeta):
"""Perceiver abstract position encoding."""
@property
@abc.abstractmethod
def num_dimensions(self) -> int:
raise NotImplementedError
@abc.abstractmethod
def output_size(self, *args, **kwargs) -> int:
raise NotImplementedError
@abc.abstractmethod
def forward(self, batch_size, pos):
raise NotImplementedError
class PerceiverTrainablePositionEncoding(PerceiverAbstractPositionEncoding):
"""Trainable position encoding."""
def __init__(self, index_dims, num_channels=128):
super().__init__()
self._num_channels = num_channels
self._index_dims = index_dims
index_dim = np.prod(index_dims)
self.position_embeddings = nn.Parameter(torch.randn(index_dim, num_channels))
@property
def num_dimensions(self) -> int:
if isinstance(self._index_dims, int):
return 1
return len(self._index_dims)
def output_size(self, *args, **kwargs) -> int:
return self._num_channels
def forward(self, batch_size):
position_embeddings = self.position_embeddings
if batch_size is not None:
position_embeddings = position_embeddings.expand(batch_size, -1, -1)
return position_embeddings
def _check_or_build_spatial_positions(pos, index_dims, batch_size):
"""
Checks or builds spatial position features (x, y, ...).
Args:
pos (:obj:`torch.FloatTensor`):
None, or an array of position features. If None, position features are built. Otherwise, their size is checked.
index_dims (:obj:`List[int]`):
An iterable giving the spatial/index size of the data to be featurized.
batch_size (:obj:`int`):
The batch size of the data to be featurized.
Returns:
:obj:`torch.FloatTensor` of shape :obj:`(batch_size, prod(index_dims))` an array of position features.
"""
if pos is None:
pos = build_linear_positions(index_dims)
pos = torch.broadcast_to(pos[None], (batch_size,) + pos.shape)
pos = torch.reshape(pos, [batch_size, np.prod(index_dims), -1])
else:
# Just a warning label: you probably don't want your spatial features to
# have a different spatial layout than your pos coordinate system.
# But feel free to override if you think it'll work!
if pos.shape[-1] != len(index_dims):
raise ValueError("Spatial features have the wrong number of dimensions.")
return pos
class PerceiverFourierPositionEncoding(PerceiverAbstractPositionEncoding):
"""Fourier (Sinusoidal) position encoding."""
def __init__(self, num_bands, max_resolution, concat_pos=True, sine_only=False):
super().__init__()
self.num_bands = num_bands
self.max_resolution = max_resolution
self.concat_pos = concat_pos
self.sine_only = sine_only
@property
def num_dimensions(self) -> int:
return len(self.max_resolution)
def output_size(self):
"""Returns size of positional encodings last dimension."""
num_dims = len(self.max_resolution)
encoding_size = self.num_bands * num_dims
if not self.sine_only:
encoding_size *= 2
if self.concat_pos:
encoding_size += self.num_dimensions
return encoding_size
def forward(self, index_dims, batch_size, device, pos=None):
pos = _check_or_build_spatial_positions(pos, index_dims, batch_size)
fourier_pos_enc = generate_fourier_features(
pos,
num_bands=self.num_bands,
max_resolution=self.max_resolution,
concat_pos=self.concat_pos,
sine_only=self.sine_only,
).to(device)
return fourier_pos_enc
class AbstractPreprocessor(nn.Module):
@property
def num_channels(self) -> int:
"""Returns size of preprocessor output."""
raise NotImplementedError()
class PerceiverTextPreprocessor(AbstractPreprocessor):
"""Text preprocessing for Perceiver Encoder."""
def __init__(self, config):
super().__init__()
self.embeddings = nn.Embedding(num_embeddings=config.vocab_size, embedding_dim=config.d_model)
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.d_model)
@property
def num_channels(self) -> int:
return self.config.d_model
def forward(self, inputs):
embeddings = self.embeddings(inputs)
seq_length = inputs.shape[1]
position_ids = torch.arange(0, seq_length, device=inputs.device)
embeddings = embeddings + self.position_embeddings(position_ids)
return embeddings, None, None
class PerceiverEmbeddingDecoder(nn.Module):
"""Module to decode embeddings (for masked language modeling)."""
def __init__(self, config):
"""Constructs the module."""
super().__init__()
self.config = config
self.vocab_size = config.vocab_size
self.bias = nn.Parameter(torch.zeros(self.vocab_size))
def forward(self, hidden_states, embedding_layer):
batch_size, seq_len, d_model = hidden_states.shape
output = torch.matmul(hidden_states.reshape([-1, d_model]), embedding_layer.weight.T) # Flatten batch dim
output = output + self.bias
return output.reshape([batch_size, seq_len, self.vocab_size])
class PerceiverMultimodalPostprocessor(nn.Module):
"""
Multimodal postprocessing for Perceiver.
Args:
modalities (:obj:`Dict[str, PostprocessorType]`):
Dictionary mapping modality name to postprocessor class for that modality.
input_is_dict (:obj:`bool`, `optional`, defaults to :obj:`False`):
If True, input is assumed to be dictionary structured, and outputs keep the same dictionary shape. If
False, input is a tensor which is sliced up during postprocessing by `modality_sizes`.
"""
def __init__(self, modalities: Mapping[str, PostprocessorType], input_is_dict: bool = False):
super().__init__()
self.modalities = nn.ModuleDict(modalities)
self.input_is_dict = input_is_dict
def forward(
self, inputs: torch.Tensor, pos: Optional[torch.Tensor] = None, modality_sizes=None
) -> Mapping[str, torch.Tensor]:
if not self.input_is_dict:
# Slice up modalities by their sizes.
if modality_sizes is None:
raise ValueError("Modality sizes should be specified if input is not a dictionary.")
inputs = restructure(modality_sizes=modality_sizes, inputs=inputs)
outputs = {
modality: postprocessor(inputs[modality], pos=pos, modality_sizes=None)
for modality, postprocessor in self.modalities.items()
}
return outputs
class PerceiverClassificationPostprocessor(nn.Module):
"""
Classification postprocessing for Perceiver. Can be used to convert the decoder output to classification logits.
Args:
config (:obj:`PerceiverConfig`):
Model configuration.
in_channels (:obj:`int`):
Number of channels in the input.
"""
def __init__(self, config, in_channels):
super().__init__()
self.classifier = nn.Linear(in_channels, config.num_labels)
def forward(self, inputs, pos: Optional[torch.Tensor] = None, modality_sizes=None) -> torch.Tensor:
logits = self.classifier(inputs)
return logits[:, 0, :]
class PerceiverAudioPostprocessor(nn.Module):
"""
Audio postprocessing for Perceiver. Can be used to convert the decoder output to audio features.
Args:
config (:obj:`PerceiverConfig`):
Model configuration.
in_channels (:obj:`int`):
Number of channels in the input.
postproc_type (:obj:`str`, `optional`, defaults to :obj:`"patches"`):
Postprocessor type to use. Currently, only "patches" is supported.
"""
def __init__(self, config, in_channels, postproc_type: str = "patches"):
super().__init__()
if postproc_type not in ("patches",): # to be supported: 'conv', 'patches', 'pixels'
raise ValueError("Invalid postproc_type!")
# Architecture parameters:
self.classifier = nn.Linear(in_channels, config.samples_per_patch)
def forward(self, inputs: torch.Tensor, pos: Optional[torch.Tensor] = None, modality_sizes=None) -> torch.Tensor:
logits = self.classifier(inputs)
return torch.reshape(logits, [inputs.shape[0], -1])
class PerceiverProjectionPostprocessor(nn.Module):
"""
Projection postprocessing for Perceiver. Can be used to convert the project the channels of the decoder output to a
lower dimension.
Args:
in_channels (:obj:`int`):
Number of channels in the input.
out_channels (:obj:`int`):
Number of channels in the output.
"""
def __init__(self, in_channels, out_channels):
super().__init__()
self.classifier = nn.Linear(in_channels, out_channels)
def forward(self, inputs: torch.Tensor, pos: Optional[torch.Tensor] = None, modality_sizes=None) -> torch.Tensor:
logits = self.classifier(inputs)
return logits
class PerceiverImagePreprocessor(AbstractPreprocessor):
"""
Image preprocessing for Perceiver Encoder.
Note: the `out_channels` argument refers to the output channels of a convolutional layer, if `prep_type` is set to
"conv1x1" or "conv". If one adds absolute position embeddings, one must make sure the `num_channels` of the
position encoding kwargs are set equal to the `out_channels`.
Args:
config (:obj:`PerceiverConfig`):
Model configuration.
prep_type (:obj:`str`, `optional`, defaults to :obj:`"conv"`):
Preprocessing type. Can be "conv1x1", "conv", "patches", "pixels".
spatial_downsample (:obj:`int`, `optional`, defaults to 4):
Spatial downsampling factor.
temporal_downsample (:obj:`int`, `optional`, defaults to 1):
Temporal downsampling factor (only relevant in case a time dimension is present).
position_encoding_type (:obj:`str`, `optional`, defaults to :obj:`"fourier"`):
Position encoding type. Can be "fourier" or "trainable".
in_channels (:obj:`int`, `optional`, defaults to 3):
Number of channels in the input.
out_channels (:obj:`int`, `optional`, defaults to 64):
Number of channels in the output.
conv_after_patching (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to apply a convolutional layer after patching.
conv_after_patching_in_channels (:obj:`int`, `optional`, defaults to 54):
Number of channels in the input of the convolutional layer after patching.
conv2d_use_batchnorm (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether to use batch normalization in the convolutional layer.
concat_or_add_pos (:obj:`str`, `optional`, defaults to :obj:`"concat"`):
How to concatenate the position encoding to the input. Can be "concat" or "add".
project_pos_dim (:obj:`int`, `optional`, defaults to -1):
Dimension of the position encoding to project to. If -1, no projection is applied.
**position_encoding_kwargs (:obj:`Dict`, `optional`):
Keyword arguments for the position encoding.
"""
def __init__(
self,
config,
prep_type="conv",
spatial_downsample: int = 4,
temporal_downsample: int = 1,
position_encoding_type: str = "fourier",
in_channels: int = 3,
out_channels: int = 64,
conv_after_patching: bool = False,
conv_after_patching_in_channels: int = 54, # only relevant when conv_after_patching = True
conv2d_use_batchnorm: bool = True,
concat_or_add_pos: str = "concat",
project_pos_dim: int = -1,
**position_encoding_kwargs,
):
super().__init__()
self.config = config
if prep_type not in ("conv", "patches", "pixels", "conv1x1"):
raise ValueError(f"Prep_type {prep_type} is invalid")
if concat_or_add_pos not in ["concat", "add"]:
raise ValueError(f"Invalid value {concat_or_add_pos} for concat_or_add_pos.")
self.in_channels = in_channels
self.prep_type = prep_type
self.spatial_downsample = spatial_downsample
self.temporal_downsample = temporal_downsample
self.position_encoding_type = position_encoding_type
self.concat_or_add_pos = concat_or_add_pos
self.conv_after_patching = conv_after_patching
self.out_channels = out_channels
if self.prep_type == "conv":
# Downsampling with conv is currently restricted
convnet_num_layers = math.log(spatial_downsample, 4)
convnet_num_layers_is_int = convnet_num_layers == np.round(convnet_num_layers)
if not convnet_num_layers_is_int or temporal_downsample != 1:
raise ValueError(
"Only powers of 4 expected for spatial and 1 expected for temporal downsampling with conv."
)
self.convnet = Conv2DDownsample(
in_channels=in_channels,
num_layers=int(convnet_num_layers),
out_channels=out_channels,
use_batchnorm=conv2d_use_batchnorm,
)
elif self.prep_type == "conv1x1":
if temporal_downsample != 1:
raise ValueError("Conv1x1 does not downsample in time.")
self.convnet_1x1 = nn.Conv2d(
in_channels=in_channels,
out_channels=out_channels,
kernel_size=(1, 1),
# spatial_downsample is unconstrained for 1x1 convolutions.
stride=(spatial_downsample, spatial_downsample),
)
# Position embeddings
self.project_pos_dim = project_pos_dim
self.position_embeddings, self.positions_projection = build_position_encoding(
position_encoding_type=position_encoding_type,
out_channels=out_channels,
project_pos_dim=project_pos_dim,
**position_encoding_kwargs,
)
# Optional convolutional layer after patches.
self.conv_after_patches = (
nn.Linear(conv_after_patching_in_channels, self.out_channels) if conv_after_patching else nn.Identity()
)
@property
def num_channels(self) -> int:
# Let's assume that the number of resolutions (in the context of image preprocessing)
# of the input data is 2 or 3 depending on whether we are processing image or video respectively.
# In this case, for convenience, we will declare is_temporal variable,
# which will show whether the data has a temporal dimension or not.
is_temporal = self.position_embeddings.num_dimensions > 2
# position embedding
if self.project_pos_dim > 0:
pos_dim = self.project_pos_dim
else:
pos_dim = self.position_embeddings.output_size()
if self.concat_or_add_pos == "add":
return pos_dim
# inputs
if self.conv_after_patching or self.prep_type in ("conv1x1", "conv"):
inp_dim = self.out_channels
elif self.prep_type == "pixels":
inp_dim = self.in_channels
if not is_temporal:
inp_dim = math.ceil(inp_dim / self.spatial_downsample)
elif self.prep_type == "patches":
if self.conv_after_patching:
inp_dim = self.out_channels
else:
inp_dim = self.in_channels * self.spatial_downsample ** 2
if is_temporal:
inp_dim *= self.temporal_downsample
return inp_dim + pos_dim
def _build_network_inputs(self, inputs: torch.Tensor, pos: torch.Tensor, network_input_is_1d: bool = True):
"""
Construct the final input, including position encoding.
This method expects the inputs to always have channels as last dimension.
"""
batch_size = inputs.shape[0]
index_dims = inputs.shape[1:-1]
indices = np.prod(index_dims)
# Flatten input features to a 1D index dimension if necessary.
if len(inputs.shape) > 3 and network_input_is_1d:
inputs = torch.reshape(inputs, [batch_size, indices, -1])
# Construct the position encoding.
if self.position_encoding_type == "trainable":
pos_enc = self.position_embeddings(batch_size)
elif self.position_encoding_type == "fourier":
pos_enc = self.position_embeddings(index_dims, batch_size, device=inputs.device)
# Optionally project them to a target dimension.
pos_enc = self.positions_projection(pos_enc)
if not network_input_is_1d:
# Reshape pos to match the input feature shape
# if the network takes non-1D inputs
sh = inputs.shape
pos_enc = torch.reshape(pos_enc, list(sh)[:-1] + [-1])
if self.concat_or_add_pos == "concat":
inputs_with_pos = torch.cat([inputs, pos_enc], dim=-1)
elif self.concat_or_add_pos == "add":
inputs_with_pos = inputs + pos_enc
return inputs_with_pos, inputs
def forward(self, inputs: torch.Tensor, pos: Optional[torch.Tensor] = None, network_input_is_1d: bool = True):
if self.prep_type == "conv":
# Convnet image featurization.
# Downsamples spatially by a factor of 4
inputs = self.convnet(inputs)
elif self.prep_type == "conv1x1":
# map inputs to self.out_channels
inputs = self.convnet_1x1(inputs)
elif self.prep_type == "pixels":
# if requested, downsamples in the crudest way
if inputs.ndim == 4:
inputs = inputs[:: self.spatial_downsample, :: self.spatial_downsample]
elif inputs.ndim == 5:
inputs = inputs[
:, :: self.temporal_downsample, :, :: self.spatial_downsample, :: self.spatial_downsample
]
else:
raise ValueError("Unsupported data format for pixels.")
elif self.prep_type == "patches":
# Space2depth featurization.
# Video: B x T x C x H x W
inputs = space_to_depth(
inputs, temporal_block_size=self.temporal_downsample, spatial_block_size=self.spatial_downsample
)
if inputs.ndim == 5 and inputs.shape[1] == 1:
# for flow
inputs = inputs.squeeze(dim=1)
# Optionally apply conv layer.
inputs = self.conv_after_patches(inputs)
if self.prep_type != "patches":
# move channels to last dimension, as the _build_network_inputs method below expects this
if inputs.ndim == 4:
inputs = torch.moveaxis(inputs, 1, -1)
elif inputs.ndim == 5:
inputs = torch.moveaxis(inputs, 2, -1)
else:
raise ValueError("Unsupported data format for conv1x1.")
inputs, inputs_without_pos = self._build_network_inputs(inputs, pos, network_input_is_1d)
modality_sizes = None # Size for each modality, only needed for multimodal
return inputs, modality_sizes, inputs_without_pos
class PerceiverOneHotPreprocessor(AbstractPreprocessor):
"""
One-hot preprocessor for Perceiver Encoder. Can be used to add a dummy index dimension to the input.
Args:
config (:obj:`PerceiverConfig`):
Model configuration.
"""
def __init__(self, config):
super().__init__()
self.config: PerceiverConfig = config
@property
def num_channels(self) -> int:
return self.config.num_labels
def forward(self, inputs: torch.Tensor, pos: Optional[torch.Tensor] = None, network_input_is_1d: bool = True):
# Add a dummy index dimension.
inputs = inputs[:, None, :]
# No position encodings, so the 1st (input) and 3rd (inputs_without_pos)
# outputs are identical.
return inputs, None, inputs
class PerceiverAudioPreprocessor(AbstractPreprocessor):
"""
Audio preprocessing for Perceiver Encoder.
Args:
config (:obj:`PerceiverConfig`):
Model configuration.
prep_type (:obj:`str`, `optional`, defaults to :obj:`"patches"`):
Preprocessor type to use. Only "patches" is supported.
samples_per_patch (:obj:`int`, `optional`, defaults to 96):
Number of samples per patch.
position_encoding_type (:obj:`str`, `optional`, defaults to :obj:`"fourier"`):
Type of position encoding to use. Can be "trainable" or "fourier".
concat_or_add_pos (:obj:`str`, `optional`, defaults to :obj:`"concat"`):
How to concatenate the position encoding to the input. Can be "concat" or "add".
out_channels (:obj:`int`, `optional`, defaults to 64):
Number of channels in the output.
project_pos_dim (:obj:`int`, `optional`, defaults to -1):
Dimension of the position encoding to project to. If -1, no projection is applied.
**position_encoding_kwargs (:obj:`Dict`, `optional`):
Keyword arguments for the position encoding.
"""
def __init__(
self,
config,
prep_type: str = "patches",
samples_per_patch: int = 96,
position_encoding_type: str = "fourier",
concat_or_add_pos: str = "concat",
out_channels=64,
project_pos_dim=-1,
**position_encoding_kwargs,
):
super().__init__()
self.config = config
if prep_type not in ("patches",):
raise ValueError(f"Prep_type {prep_type} is invalid, can only be 'patches'.")
if concat_or_add_pos not in ["concat", "add"]:
raise ValueError(f"Concat_or_pos {concat_or_add_pos} is invalid, can only be 'concat' or 'add'.")
self.samples_per_patch = samples_per_patch
self.position_encoding_type = position_encoding_type
self.concat_or_add_pos = concat_or_add_pos
self.project_pos_dim = project_pos_dim
# Position embeddings
self.position_embeddings, self.positions_projection = build_position_encoding(
position_encoding_type=position_encoding_type,
out_channels=out_channels,
project_pos_dim=project_pos_dim,
**position_encoding_kwargs,
)
@property
def num_channels(self) -> int:
# position embedding
if self.project_pos_dim > 0:
pos_dim = self.project_pos_dim
else:
pos_dim = self.position_embeddings.output_size()
if self.concat_or_add_pos == "add":
return pos_dim
return self.samples_per_patch + pos_dim
def _build_network_inputs(self, inputs, pos):
"""Construct the final input, including position encoding."""
batch_size = inputs.shape[0]
index_dims = inputs.shape[1:-1]
# Construct the position encoding.
if self.position_encoding_type == "trainable":
pos_enc = self.position_embeddings(batch_size)
elif self.position_encoding_type == "fourier":
pos_enc = self.position_embeddings(index_dims, batch_size, device=inputs.device)
# Optionally project them to a target dimension.
pos_enc = self.positions_projection(pos_enc)
if self.concat_or_add_pos == "concat":
inputs_with_pos = torch.cat([inputs, pos_enc], dim=-1)
elif self.concat_or_add_pos == "add":
inputs_with_pos = inputs + pos_enc
return inputs_with_pos, inputs
def forward(self, inputs, pos, network_input_is_1d: bool = True):
inputs = torch.reshape(inputs, [inputs.shape[0], -1, self.samples_per_patch])
inputs, inputs_without_pos = self._build_network_inputs(inputs, pos)
modality_sizes = None # Size for each modality, only needed for multimodal
return inputs, modality_sizes, inputs_without_pos
class PerceiverMultimodalPreprocessor(AbstractPreprocessor):
"""
Multimodal preprocessing for Perceiver Encoder.
Inputs for each modality are preprocessed, then padded with trainable position embeddings to have the same number
of channels.
Args:
modalities (:obj:`Dict[str, PreprocessorType]`):
Dict mapping modality name to preprocessor.
mask_probs (:obj:`Dict[str, float]`):
Dict mapping modality name to masking probability of that modality.
min_padding_size (:obj:`int`, `optional`, defaults to 2):
The minimum padding size for all modalities. The final output will have num_channels equal to the maximum
channels across all modalities plus min_padding_size.
"""
def __init__(
self,
modalities: Mapping[str, PreprocessorType],
mask_probs: Optional[Mapping[str, float]] = None,
min_padding_size: int = 2,
):
super().__init__()
self.modalities = modalities
self.min_padding_size = min_padding_size
self.mask_probs = mask_probs if mask_probs is not None else dict()
self.padding = nn.ParameterDict(
{
modality: nn.Parameter(torch.randn(1, self.num_channels - preprocessor.num_channels))
for modality, preprocessor in modalities.items()
}
)
self.mask = nn.ParameterDict(
{modality: nn.Parameter(torch.randn(1, self.num_channels)) for modality, _ in self.mask_probs.items()}
)
@property
def num_channels(self) -> int:
max_channel_size = max(processor.num_channels for _, processor in self.modalities.items())
common_channel_size = max_channel_size + self.min_padding_size
return common_channel_size
def forward(
self, inputs: Mapping[str, torch.Tensor], pos: Optional[torch.Tensor] = None, network_input_is_1d: bool = True
) -> PreprocessorOutputType:
padded = {}
modality_sizes = {}
inputs_without_pos = {}
for modality, preprocessor in self.modalities.items():
# preprocess each modality using the respective preprocessor.
output, _, inputs_without_pos[modality] = preprocessor(
inputs[modality], pos=pos, network_input_is_1d=network_input_is_1d
)
# pad to the same common_channel_size.
batch_size, num_samples, num_channels = output.shape
pos_enc = self.padding[modality].expand(batch_size, -1, -1)
padding = torch.broadcast_to(
pos_enc,
[batch_size, num_samples, self.num_channels - num_channels],
)
output_padded = torch.cat([output, padding], dim=2)
# mask if required
if modality in self.mask_probs:
mask_token = self.mask[modality].expand(batch_size, -1, -1)
mask_prob = self.mask_probs[modality]
mask = torch.bernoulli(torch.full([batch_size, num_samples], mask_prob))
mask = torch.unsqueeze(mask, dim=2).to(mask_token.device)
output_padded = (1 - mask) * output_padded + mask * mask_token
padded[modality] = output_padded
modality_sizes[modality] = output_padded.shape[1]
# Apply a predictable ordering to the modalities
padded_ls = [padded[k] for k in sorted(padded.keys())]
# Finally, concatenate along the time dimension
final_inputs = torch.cat(padded_ls, dim=1)
return final_inputs, modality_sizes, inputs_without_pos
# coding=utf-8
# Copyright 2021 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Tokenization class for Perceiver."""
from typing import Dict, List, Optional, Tuple
from ...tokenization_utils import AddedToken, PreTrainedTokenizer
from ...utils import logging
logger = logging.get_logger(__name__)
class PerceiverTokenizer(PreTrainedTokenizer):
"""
Construct a Perceiver tokenizer. The Perceiver simply uses raw bytes utf-8 encoding.
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
Users should refer to this superclass for more information regarding those methods.
Args:
pad_token (:obj:`str`, `optional`, defaults to :obj:`"[PAD]"`):
The token used for padding, for example when batching sequences of different lengths.
bos_token (:obj:`str`, `optional`, defaults to :obj:`"[BOS]"`):
The BOS token (reserved in the vocab, but not actually used).
eos_token (:obj:`str`, `optional`, defaults to :obj:`"[EOS]"`):
The end of sequence token (reserved in the vocab, but not actually used).
.. note::
When building a sequence using special tokens, this is not the token that is used for the end of
sequence. The token used is the :obj:`sep_token`.
mask_token (:obj:`str`, `optional`, defaults to :obj:`"[MASK]"`):
The MASK token, useful for masked language modeling.
cls_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`):
The CLS token (reserved in the vocab, but not actually used).
sep_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`):
The separator token, which is used when building a sequence from two sequences.
"""
model_input_names = ["input_ids", "attention_mask"]
def __init__(
self,
pad_token="[PAD]",
bos_token="[BOS]",
eos_token="[EOS]",
mask_token="[MASK]",
cls_token="[CLS]",
sep_token="[SEP]",
model_max_length=2048,
**kwargs
) -> None:
pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
mask_token = AddedToken(mask_token, lstrip=False, rstrip=False) if isinstance(mask_token, str) else mask_token
cls_token = AddedToken(cls_token, lstrip=False, rstrip=False) if isinstance(cls_token, str) else cls_token
sep_token = AddedToken(sep_token, lstrip=False, rstrip=False) if isinstance(sep_token, str) else sep_token
super().__init__(
pad_token=pad_token,
bos_token=bos_token,
eos_token=eos_token,
mask_token=mask_token,
cls_token=cls_token,
sep_token=sep_token,
model_max_length=model_max_length,
**kwargs,
)
self._utf_vocab_size = 2 ** 8 # utf is 8 bits
# define special tokens dict
self.special_tokens_encoder: Dict[int, str] = {
self.pad_token: 0,
self.bos_token: 1,
self.eos_token: 2,
self.mask_token: 3,
self.cls_token: 4,
self.sep_token: 5,
}
self._num_special_tokens = len(self.special_tokens_encoder)
self.special_tokens_decoder: Dict[str, int] = {v: k for k, v in self.special_tokens_encoder.items()}
@property
def vocab_size(self):
return self._utf_vocab_size + self._num_special_tokens
def get_special_tokens_mask(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
) -> List[int]:
"""
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer ``prepare_for_model`` method.
Args:
token_ids_0 (:obj:`List[int]`):
List of IDs.
token_ids_1 (:obj:`List[int]`, `optional`):
Optional second list of IDs for sequence pairs.
already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not the token list is already formatted with special tokens for the model.
Returns:
:obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
"""
if already_has_special_tokens:
return super().get_special_tokens_mask(
token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
)
# normal case: some special tokens
if token_ids_1 is None:
return [1] + [0] * len(token_ids_0) + [1]
return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
def build_inputs_with_special_tokens(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
"""
Build model inputs from a sequence or a pair of sequence for sequence classification tasks. A sequence has the
following format:
- single sequence: ``[CLS] X [SEP]``
- pair of sequences: ``[CLS] A [SEP] B [SEP]``
Args:
token_ids_0 (:obj:`List[int]`):
List of IDs to which the special tokens will be added.
token_ids_1 (:obj:`List[int]`, `optional`):
Optional second list of IDs for sequence pairs.
Returns:
:obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
"""
if token_ids_1 is None:
return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
else:
return [self.cls_token_id] + token_ids_0 + [self.sep_token_id] + token_ids_1 + [self.sep_token_id]
def _tokenize(self, text: str) -> List[str]:
"""Take as input a string and return a list of strings (tokens) for words/sub-words"""
tokens = [chr(i) for i in text.encode("utf-8")]
return tokens
def _convert_token_to_id(self, token):
"""Converts a token (str) in an id using the vocab."""
if token in self.special_tokens_encoder:
token_id = self.special_tokens_encoder[token]
elif token in self.added_tokens_encoder:
token_id = self.added_tokens_encoder[token]
elif len(token) != 1:
token_id = self.unk_token_id
else:
token_id = ord(token) + self._num_special_tokens
return token_id
def _convert_id_to_token(self, index):
"""Converts an index (integer) in a token (str) using the vocab."""
if index in self.special_tokens_decoder:
token = self.special_tokens_decoder[index]
elif index in self.added_tokens_decoder:
token = self.added_tokens_decoder[index]
else:
token = chr(index - self._num_special_tokens)
return token
def convert_tokens_to_string(self, tokens):
"""Converts a sequence of tokens (string) in a single string."""
bstring = b""
for token in tokens:
if token in self.special_tokens_decoder:
tok_string = self.special_tokens_decoder[token].encode("utf-8")
elif token in self.added_tokens_decoder:
tok_string = self.special_tokens_decoder[token].encode("utf-8")
elif token in self.special_tokens_encoder:
tok_string = token.encode("utf-8")
elif token in self.added_tokens_encoder:
tok_string = token.encode("utf-8")
else:
tok_string = bytes([ord(token)])
bstring += tok_string
string = bstring.decode("utf-8", errors="replace")
return string
# PerceiverTokenizer has no vocab file
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
return ()
...@@ -3753,6 +3753,87 @@ class PegasusPreTrainedModel: ...@@ -3753,6 +3753,87 @@ class PegasusPreTrainedModel:
requires_backends(self, ["torch"]) requires_backends(self, ["torch"])
PERCEIVER_PRETRAINED_MODEL_ARCHIVE_LIST = None
class PerceiverForImageClassificationConvProcessing:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class PerceiverForImageClassificationFourier:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class PerceiverForImageClassificationLearned:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class PerceiverForMaskedLM:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
@classmethod
def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch"])
def forward(self, *args, **kwargs):
requires_backends(self, ["torch"])
class PerceiverForMultimodalAutoencoding:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class PerceiverForOpticalFlow:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class PerceiverForSequenceClassification:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
@classmethod
def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch"])
def forward(self, *args, **kwargs):
requires_backends(self, ["torch"])
class PerceiverLayer:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class PerceiverModel:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
@classmethod
def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch"])
def forward(self, *args, **kwargs):
requires_backends(self, ["torch"])
class PerceiverPreTrainedModel:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
@classmethod
def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch"])
def forward(self, *args, **kwargs):
requires_backends(self, ["torch"])
PROPHETNET_PRETRAINED_MODEL_ARCHIVE_LIST = None PROPHETNET_PRETRAINED_MODEL_ARCHIVE_LIST = None
......
...@@ -64,6 +64,11 @@ class LayoutXLMProcessor: ...@@ -64,6 +64,11 @@ class LayoutXLMProcessor:
requires_backends(cls, ["vision"]) requires_backends(cls, ["vision"])
class PerceiverFeatureExtractor:
def __init__(self, *args, **kwargs):
requires_backends(self, ["vision"])
class SegformerFeatureExtractor: class SegformerFeatureExtractor:
def __init__(self, *args, **kwargs): def __init__(self, *args, **kwargs):
requires_backends(self, ["vision"]) requires_backends(self, ["vision"])
......
# coding=utf-8
# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Testing suite for the PyTorch Perceiver model. """
import copy
import inspect
import math
import tempfile
import unittest
import warnings
from typing import Dict, List, Tuple
import numpy as np
from datasets import load_dataset
from transformers import PerceiverConfig
from transformers.file_utils import is_torch_available, is_vision_available
from transformers.models.auto import get_values
from transformers.testing_utils import require_torch, require_vision, slow, torch_device
from .test_configuration_common import ConfigTester
from .test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor, random_attention_mask
if is_torch_available():
import torch
from torch import nn
from transformers import (
MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING,
MODEL_FOR_MASKED_LM_MAPPING,
MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING,
MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING,
MODEL_MAPPING,
PerceiverForImageClassificationConvProcessing,
PerceiverForImageClassificationFourier,
PerceiverForImageClassificationLearned,
PerceiverForMaskedLM,
PerceiverForMultimodalAutoencoding,
PerceiverForOpticalFlow,
PerceiverForSequenceClassification,
PerceiverModel,
PerceiverTokenizer,
)
from transformers.models.perceiver.modeling_perceiver import PERCEIVER_PRETRAINED_MODEL_ARCHIVE_LIST
if is_vision_available():
from PIL import Image
from transformers import PerceiverFeatureExtractor
class PerceiverModelTester:
def __init__(
self,
parent,
batch_size=13,
seq_length=7,
num_channels=3,
image_size=32,
train_size=[20, 20],
num_frames=5,
audio_samples_per_frame=200,
samples_per_patch=20,
nchunks=20,
num_latents=10,
d_latents=20,
num_blocks=1,
num_self_attends_per_block=2,
num_self_attention_heads=1,
num_cross_attention_heads=1,
is_training=True,
use_input_mask=True,
use_labels=True,
vocab_size=99,
hidden_act="gelu",
attention_probs_dropout_prob=0.1,
initializer_range=0.02,
max_position_embeddings=7,
num_labels=3,
scope=None,
):
self.parent = parent
self.batch_size = batch_size
self.seq_length = seq_length
self.num_channels = num_channels
self.image_size = image_size
self.train_size = train_size
self.num_frames = num_frames
self.audio_samples_per_frame = audio_samples_per_frame
self.samples_per_patch = samples_per_patch
self.nchunks = nchunks
self.num_latents = num_latents
self.d_latents = d_latents
self.num_blocks = num_blocks
self.num_self_attends_per_block = num_self_attends_per_block
self.num_self_attention_heads = num_self_attention_heads
self.num_cross_attention_heads = num_cross_attention_heads
self.is_training = is_training
self.use_input_mask = use_input_mask
self.use_labels = use_labels
self.vocab_size = vocab_size
self.hidden_act = hidden_act
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.initializer_range = initializer_range
self.num_labels = num_labels
self.scope = scope
# set subsampling for multimodal model (take first chunk)
image_chunk_size = np.prod((self.num_frames, self.image_size, self.image_size)) // self.nchunks
audio_chunk_size = self.num_frames * self.audio_samples_per_frame // self.samples_per_patch // self.nchunks
self.subsampling = {
"image": torch.arange(0, image_chunk_size),
"audio": torch.arange(0, audio_chunk_size),
"label": None,
}
def prepare_config_and_inputs(self, model_class=None):
config = self.get_config()
input_mask = None
sequence_labels = None
token_labels = None
if self.use_labels:
sequence_labels = ids_tensor([self.batch_size], self.num_labels)
token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
if model_class is None or model_class.__name__ == "PerceiverModel":
inputs = floats_tensor([self.batch_size, self.seq_length, config.d_model], self.vocab_size)
return config, inputs, input_mask, sequence_labels, token_labels
elif model_class.__name__ in ["PerceiverForMaskedLM", "PerceiverForSequenceClassification"]:
inputs = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
# input mask is only relevant for text inputs
if self.use_input_mask:
input_mask = random_attention_mask([self.batch_size, self.seq_length])
elif model_class.__name__ == "PerceiverForImageClassificationLearned":
config.d_model = 512
inputs = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
elif model_class.__name__ == "PerceiverForImageClassificationFourier":
config.d_model = 261
inputs = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
elif model_class.__name__ == "PerceiverForImageClassificationConvProcessing":
config.d_model = 322
inputs = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
elif model_class.__name__ == "PerceiverForOpticalFlow":
config.d_model = 322
inputs = floats_tensor([self.batch_size, 2, 27, self.train_size[0], self.train_size[1]])
elif model_class.__name__ == "PerceiverForMultimodalAutoencoding":
config.d_model = 409
images = torch.randn(
(self.batch_size, self.num_frames, self.num_channels, self.image_size, self.image_size),
device=torch_device,
)
audio = torch.randn(
(self.batch_size, self.num_frames * self.audio_samples_per_frame, 1), device=torch_device
)
inputs = dict(
image=images, audio=audio, label=torch.zeros((self.batch_size, self.num_labels), device=torch_device)
)
else:
raise ValueError(f"Model class {model_class} not supported")
return config, inputs, input_mask, sequence_labels, token_labels
def get_config(self):
return PerceiverConfig(
num_latents=self.num_latents,
d_latents=self.d_latents,
num_blocks=self.num_blocks,
num_self_attends_per_block=self.num_self_attends_per_block,
num_self_attention_heads=self.num_self_attention_heads,
num_cross_attention_heads=self.num_cross_attention_heads,
vocab_size=self.vocab_size,
hidden_act=self.hidden_act,
attention_probs_dropout_prob=self.attention_probs_dropout_prob,
initializer_range=self.initializer_range,
max_position_embeddings=self.max_position_embeddings,
image_size=self.image_size,
train_size=self.train_size,
num_frames=self.num_frames,
audio_samples_per_frame=self.audio_samples_per_frame,
samples_per_patch=self.samples_per_patch,
num_labels=self.num_labels,
)
def create_and_check_for_masked_lm(self, config, inputs, input_mask, sequence_labels, token_labels):
model = PerceiverForMaskedLM(config=config)
model.to(torch_device)
model.eval()
result = model(inputs, attention_mask=input_mask, labels=token_labels)
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length, self.vocab_size))
def create_and_check_for_sequence_classification(self, config, inputs, input_mask, sequence_labels, token_labels):
# set num_labels
config.num_labels = self.num_labels
model = PerceiverForSequenceClassification(config=config)
model.to(torch_device)
model.eval()
result = model(inputs, attention_mask=input_mask, labels=sequence_labels)
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.num_labels))
def create_and_check_for_image_classification_learned(
self, config, inputs, input_mask, sequence_labels, token_labels
):
# set d_model and num_labels
config.d_model = 512
config.num_labels = self.num_labels
model = PerceiverForImageClassificationLearned(config=config)
model.to(torch_device)
model.eval()
result = model(inputs, attention_mask=input_mask, labels=sequence_labels)
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.num_labels))
def create_and_check_for_image_classification_fourier(
self, config, inputs, input_mask, sequence_labels, token_labels
):
# set d_model and num_labels
config.d_model = 261
config.num_labels = self.num_labels
model = PerceiverForImageClassificationFourier(config=config)
model.to(torch_device)
model.eval()
result = model(inputs, attention_mask=input_mask, labels=sequence_labels)
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.num_labels))
def create_and_check_for_image_classification_conv(
self, config, inputs, input_mask, sequence_labels, token_labels
):
# set d_model and num_labels
config.d_model = 322
config.num_labels = self.num_labels
model = PerceiverForImageClassificationConvProcessing(config=config)
model.to(torch_device)
model.eval()
result = model(inputs, attention_mask=input_mask, labels=sequence_labels)
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.num_labels))
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
config, inputs, input_mask, sequence_labels, token_labels = config_and_inputs
inputs_dict = {"inputs": inputs, "attention_mask": input_mask}
return config, inputs_dict
def prepare_config_and_inputs_for_model_class(self, model_class):
config_and_inputs = self.prepare_config_and_inputs(model_class)
config, inputs, input_mask, sequence_labels, token_labels = config_and_inputs
inputs_dict = {"inputs": inputs, "attention_mask": input_mask}
return config, inputs_dict
@require_torch
class PerceiverModelTest(ModelTesterMixin, unittest.TestCase):
all_model_classes = (
(
PerceiverModel,
PerceiverForMaskedLM,
PerceiverForImageClassificationLearned,
PerceiverForImageClassificationConvProcessing,
PerceiverForImageClassificationFourier,
PerceiverForOpticalFlow,
PerceiverForMultimodalAutoencoding,
PerceiverForSequenceClassification,
)
if is_torch_available()
else ()
)
test_pruning = False
test_head_masking = False
test_torchscript = False
maxDiff = None
def setUp(self):
self.model_tester = PerceiverModelTester(self)
self.config_tester = ConfigTester(self, config_class=PerceiverConfig, hidden_size=37)
def _prepare_for_class(self, inputs_dict, model_class, return_labels=False):
inputs_dict = copy.deepcopy(inputs_dict)
if model_class.__name__ == "PerceiverForMultimodalAutoencoding":
inputs_dict["subsampled_output_points"] = self.model_tester.subsampling
if return_labels:
if model_class in [
*get_values(MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING),
*get_values(MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING),
]:
inputs_dict["labels"] = torch.zeros(
self.model_tester.batch_size, dtype=torch.long, device=torch_device
)
elif model_class in [
*get_values(MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING),
*get_values(MODEL_FOR_MASKED_LM_MAPPING),
]:
inputs_dict["labels"] = torch.zeros(
(self.model_tester.batch_size, self.model_tester.seq_length), dtype=torch.long, device=torch_device
)
return inputs_dict
def test_config(self):
# we don't test common_properties and arguments_init as these don't apply for Perceiver
self.config_tester.create_and_test_config_to_json_string()
self.config_tester.create_and_test_config_to_json_file()
self.config_tester.create_and_test_config_from_and_save_pretrained()
self.config_tester.create_and_test_config_with_num_labels()
self.config_tester.check_config_can_be_init_without_params()
def test_for_masked_lm(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs(model_class=PerceiverForMaskedLM)
self.model_tester.create_and_check_for_masked_lm(*config_and_inputs)
def test_for_sequence_classification(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs(model_class=PerceiverForSequenceClassification)
self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs)
def test_for_image_classification_learned(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs(
model_class=PerceiverForImageClassificationLearned
)
self.model_tester.create_and_check_for_image_classification_learned(*config_and_inputs)
def test_for_image_classification_fourier(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs(
model_class=PerceiverForImageClassificationFourier
)
self.model_tester.create_and_check_for_image_classification_fourier(*config_and_inputs)
def test_for_image_classification_conv(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs(
model_class=PerceiverForImageClassificationConvProcessing
)
self.model_tester.create_and_check_for_image_classification_conv(*config_and_inputs)
def test_model_common_attributes(self):
for model_class in self.all_model_classes:
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_model_class(model_class)
model = model_class(config)
# we overwrite this, as the embeddings of Perceiver are an instance of nn.Parameter
# and Perceiver doesn't support get_output_embeddings
self.assertIsInstance(model.get_input_embeddings(), (nn.Parameter))
def test_training(self):
if not self.model_tester.is_training:
return
for model_class in self.all_model_classes:
if model_class in [
*get_values(MODEL_MAPPING),
PerceiverForOpticalFlow,
PerceiverForMultimodalAutoencoding,
]:
continue
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_model_class(model_class)
config.return_dict = True
model = model_class(config)
model.to(torch_device)
model.train()
inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
loss = model(**inputs).loss
loss.backward()
def test_forward_signature(self):
for model_class in self.all_model_classes:
config, _ = self.model_tester.prepare_config_and_inputs_for_model_class(model_class)
model = model_class(config)
signature = inspect.signature(model.forward)
# signature.parameters is an OrderedDict => so arg_names order is deterministic
arg_names = [*signature.parameters.keys()]
expected_arg_names = ["inputs"]
self.assertListEqual(arg_names[:1], expected_arg_names)
def test_determinism(self):
for model_class in self.all_model_classes:
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_model_class(model_class)
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
inputs_dict = self._prepare_for_class(inputs_dict, model_class)
first = model(**inputs_dict)[0]
second = model(**inputs_dict)[0]
if model_class.__name__ == "PerceiverForMultimodalAutoencoding":
# model outputs a dictionary with logits per modality, let's verify each modality
for modality in first.keys():
out_1 = first[modality].cpu().numpy()
out_2 = second[modality].cpu().numpy()
out_1 = out_1[~np.isnan(out_1)]
out_2 = out_2[~np.isnan(out_2)]
max_diff = np.amax(np.abs(out_1 - out_2))
self.assertLessEqual(max_diff, 1e-5)
else:
out_1 = first.cpu().numpy()
out_2 = second.cpu().numpy()
out_1 = out_1[~np.isnan(out_1)]
out_2 = out_2[~np.isnan(out_2)]
max_diff = np.amax(np.abs(out_1 - out_2))
self.assertLessEqual(max_diff, 1e-5)
def test_attention_outputs(self):
seq_len = getattr(self.model_tester, "num_latents", None)
for model_class in self.all_model_classes:
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_model_class(model_class)
config.return_dict = True
inputs_dict["output_attentions"] = True
inputs_dict["output_hidden_states"] = False
config.return_dict = True
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
self_attentions = outputs.attentions
cross_attentions = outputs.cross_attentions
# check expected number of attentions depending on model class
expected_num_self_attentions = self.model_tester.num_blocks * self.model_tester.num_self_attends_per_block
if model.__class__.__name__ == "PerceiverModel":
# we expect to have 2 cross-attentions, namely one in the PerceiverEncoder, and one in PerceiverBasicDecoder
expected_num_cross_attentions = 1
else:
# we expect to have 2 cross-attentions, namely one in the PerceiverEncoder, and one in PerceiverBasicDecoder
expected_num_cross_attentions = 2
self.assertEqual(len(self_attentions), expected_num_self_attentions)
self.assertEqual(len(cross_attentions), expected_num_cross_attentions)
# check that output_attentions also work using config
del inputs_dict["output_attentions"]
config.output_attentions = True
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
self_attentions = outputs.attentions
cross_attentions = outputs.cross_attentions
self.assertEqual(len(self_attentions), expected_num_self_attentions)
self.assertEqual(len(cross_attentions), expected_num_cross_attentions)
self.assertListEqual(
list(self_attentions[0].shape[-3:]),
[self.model_tester.num_self_attention_heads, seq_len, seq_len],
)
out_len = len(outputs)
# Check attention is always last and order is fine
inputs_dict["output_attentions"] = True
inputs_dict["output_hidden_states"] = True
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
self.assertEqual(out_len + 1, len(outputs))
self_attentions = outputs.attentions
self.assertEqual(len(self_attentions), expected_num_self_attentions)
self.assertListEqual(
list(self_attentions[0].shape[-3:]),
[self.model_tester.num_self_attention_heads, seq_len, seq_len],
)
def test_hidden_states_output(self):
def check_hidden_states_output(inputs_dict, config, model_class):
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
hidden_states = outputs.hidden_states
expected_num_layers = self.model_tester.num_blocks * self.model_tester.num_self_attends_per_block + 1
self.assertEqual(len(hidden_states), expected_num_layers)
seq_length = self.model_tester.num_latents
self.assertListEqual(
list(hidden_states[0].shape[-2:]),
[seq_length, self.model_tester.d_latents],
)
for model_class in self.all_model_classes:
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_model_class(model_class)
inputs_dict["output_hidden_states"] = True
check_hidden_states_output(inputs_dict, config, model_class)
# check that output_hidden_states also work using config
del inputs_dict["output_hidden_states"]
config.output_hidden_states = True
check_hidden_states_output(inputs_dict, config, model_class)
def test_model_outputs_equivalence(self):
def set_nan_tensor_to_zero(t):
t[t != t] = 0
return t
def check_equivalence(model, tuple_inputs, dict_inputs, additional_kwargs={}):
with torch.no_grad():
tuple_output = model(**tuple_inputs, return_dict=False, **additional_kwargs)
dict_output = model(**dict_inputs, return_dict=True, **additional_kwargs).to_tuple()
def recursive_check(tuple_object, dict_object):
if isinstance(tuple_object, (List, Tuple)):
for tuple_iterable_value, dict_iterable_value in zip(tuple_object, dict_object):
recursive_check(tuple_iterable_value, dict_iterable_value)
elif isinstance(tuple_object, Dict):
for tuple_iterable_value, dict_iterable_value in zip(
tuple_object.values(), dict_object.values()
):
recursive_check(tuple_iterable_value, dict_iterable_value)
elif tuple_object is None:
return
else:
self.assertTrue(
torch.allclose(
set_nan_tensor_to_zero(tuple_object), set_nan_tensor_to_zero(dict_object), atol=1e-5
),
msg=f"Tuple and dict output are not equal. Difference: {torch.max(torch.abs(tuple_object - dict_object))}. "
f"Tuple has `nan`: {torch.isnan(tuple_object).any()} and `inf`: {torch.isinf(tuple_object)}. "
f"Dict has `nan`: {torch.isnan(dict_object).any()} and `inf`: {torch.isinf(dict_object)}.",
)
recursive_check(tuple_output, dict_output)
for model_class in self.all_model_classes:
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_model_class(model_class)
model = model_class(config)
model.to(torch_device)
model.eval()
tuple_inputs = self._prepare_for_class(inputs_dict, model_class)
dict_inputs = self._prepare_for_class(inputs_dict, model_class)
check_equivalence(model, tuple_inputs, dict_inputs)
if model_class.__name__ not in ["PerceiverForOpticalFlow", "PerceiverForMultimodalAutoencoding"]:
# optical flow + multimodal models don't support training for now
tuple_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
dict_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
check_equivalence(model, tuple_inputs, dict_inputs)
tuple_inputs = self._prepare_for_class(inputs_dict, model_class)
dict_inputs = self._prepare_for_class(inputs_dict, model_class)
check_equivalence(model, tuple_inputs, dict_inputs, {"output_hidden_states": True})
tuple_inputs = self._prepare_for_class(inputs_dict, model_class)
dict_inputs = self._prepare_for_class(inputs_dict, model_class)
check_equivalence(model, tuple_inputs, dict_inputs, {"output_attentions": True})
if model_class.__name__ not in ["PerceiverForOpticalFlow", "PerceiverForMultimodalAutoencoding"]:
# optical flow + multimodal models don't support training for now
tuple_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
dict_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
check_equivalence(model, tuple_inputs, dict_inputs, {"output_hidden_states": True})
if model_class.__name__ not in ["PerceiverForOpticalFlow", "PerceiverForMultimodalAutoencoding"]:
# optical flow + multimodal models don't support training for now
tuple_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
dict_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
check_equivalence(model, tuple_inputs, dict_inputs, {"output_attentions": True})
if model_class.__name__ not in ["PerceiverForOpticalFlow", "PerceiverForMultimodalAutoencoding"]:
# optical flow + multimodal models don't support training for now
tuple_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
dict_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
check_equivalence(
model, tuple_inputs, dict_inputs, {"output_hidden_states": True, "output_attentions": True}
)
def test_retain_grad_hidden_states_attentions(self):
# no need to test all models as different heads yield the same functionality
model_class = PerceiverForMaskedLM
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_model_class(model_class)
config.output_hidden_states = True
config.output_attentions = True
model = model_class(config)
model.to(torch_device)
inputs = self._prepare_for_class(inputs_dict, model_class)
outputs = model(**inputs)
output = outputs[0]
# Encoder-only model
hidden_states = outputs.hidden_states[0]
attentions = outputs.attentions[0]
hidden_states.retain_grad()
attentions.retain_grad()
output.flatten()[0].backward(retain_graph=True)
self.assertIsNotNone(hidden_states.grad)
self.assertIsNotNone(attentions.grad)
def test_feed_forward_chunking(self):
for model_class in self.all_model_classes:
original_config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_model_class(model_class)
torch.manual_seed(0)
config = copy.deepcopy(original_config)
model = model_class(config)
model.to(torch_device)
model.eval()
hidden_states_no_chunk = model(**self._prepare_for_class(inputs_dict, model_class))[0]
torch.manual_seed(0)
config.chunk_size_feed_forward = 1
model = model_class(config)
model.to(torch_device)
model.eval()
hidden_states_with_chunk = model(**self._prepare_for_class(inputs_dict, model_class))[0]
if model_class.__name__ == "PerceiverForMultimodalAutoencoding":
# model outputs a dictionary with logits for each modality
for modality in hidden_states_no_chunk.keys():
self.assertTrue(
torch.allclose(hidden_states_no_chunk[modality], hidden_states_with_chunk[modality], atol=1e-3)
)
else:
self.assertTrue(torch.allclose(hidden_states_no_chunk, hidden_states_with_chunk, atol=1e-3))
def test_save_load(self):
for model_class in self.all_model_classes:
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_model_class(model_class)
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
if model_class.__name__ == "PerceiverForMultimodalAutoencoding":
for modality in outputs[0].keys():
out_2 = outputs[0][modality].cpu().numpy()
out_2[np.isnan(out_2)] = 0
with tempfile.TemporaryDirectory() as tmpdirname:
model.save_pretrained(tmpdirname)
model = model_class.from_pretrained(tmpdirname)
model.to(torch_device)
with torch.no_grad():
after_outputs = model(**self._prepare_for_class(inputs_dict, model_class))
# Make sure we don't have nans
out_1 = after_outputs[0][modality].cpu().numpy()
out_1[np.isnan(out_1)] = 0
max_diff = np.amax(np.abs(out_1 - out_2))
self.assertLessEqual(max_diff, 1e-5)
else:
out_2 = outputs[0].cpu().numpy()
out_2[np.isnan(out_2)] = 0
with tempfile.TemporaryDirectory() as tmpdirname:
model.save_pretrained(tmpdirname)
model = model_class.from_pretrained(tmpdirname)
model.to(torch_device)
with torch.no_grad():
after_outputs = model(**self._prepare_for_class(inputs_dict, model_class))
# Make sure we don't have nans
out_1 = after_outputs[0].cpu().numpy()
out_1[np.isnan(out_1)] = 0
max_diff = np.amax(np.abs(out_1 - out_2))
self.assertLessEqual(max_diff, 1e-5)
def test_correct_missing_keys(self):
if not self.test_missing_keys:
return
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
# most Perceiver models don't have a typical head like is the case with BERT
if model_class in [
PerceiverForOpticalFlow,
PerceiverForMultimodalAutoencoding,
*get_values(MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING),
*get_values(MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING),
]:
continue
model = model_class(config)
base_model_prefix = model.base_model_prefix
if hasattr(model, base_model_prefix):
with tempfile.TemporaryDirectory() as temp_dir_name:
model.base_model.save_pretrained(temp_dir_name)
model, loading_info = model_class.from_pretrained(temp_dir_name, output_loading_info=True)
with self.subTest(msg=f"Missing keys for {model.__class__.__name__}"):
self.assertGreater(len(loading_info["missing_keys"]), 0)
def test_problem_types(self):
problem_types = [
{"title": "multi_label_classification", "num_labels": 2, "dtype": torch.float},
{"title": "single_label_classification", "num_labels": 1, "dtype": torch.long},
{"title": "regression", "num_labels": 1, "dtype": torch.float},
]
for model_class in self.all_model_classes:
if model_class not in get_values(MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING):
continue
config, inputs, input_mask, _, _ = self.model_tester.prepare_config_and_inputs(model_class=model_class)
inputs_dict = dict(inputs=inputs, attention_mask=input_mask)
for problem_type in problem_types:
with self.subTest(msg=f"Testing {model_class} with {problem_type['title']}"):
config.problem_type = problem_type["title"]
config.num_labels = problem_type["num_labels"]
model = model_class(config)
model.to(torch_device)
model.train()
inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
if problem_type["num_labels"] > 1:
inputs["labels"] = inputs["labels"].unsqueeze(1).repeat(1, problem_type["num_labels"])
inputs["labels"] = inputs["labels"].to(problem_type["dtype"])
# This tests that we do not trigger the warning form PyTorch "Using a target size that is different
# to the input size. This will likely lead to incorrect results due to broadcasting. Please ensure
# they have the same size." which is a symptom something in wrong for the regression problem.
# See https://github.com/huggingface/transformers/issues/11780
with warnings.catch_warnings(record=True) as warning_list:
loss = model(**inputs).loss
for w in warning_list:
if "Using a target size that is different to the input size" in str(w.message):
raise ValueError(
f"Something is going wrong in the regression problem: intercepted {w.message}"
)
loss.backward()
@unittest.skip(reason="Perceiver models don't have a typical head like is the case with BERT")
def test_save_load_fast_init_from_base(self):
pass
@unittest.skip(reason="Perceiver models don't have a typical head like is the case with BERT")
def test_save_load_fast_init_to_base(self):
pass
@unittest.skip(reason="Perceiver doesn't support resize_token_embeddings")
def test_resize_tokens_embeddings(self):
pass
@unittest.skip(reason="Perceiver doesn't support resize_token_embeddings")
def test_resize_embeddings_untied(self):
pass
@unittest.skip(reason="Perceiver doesn't support inputs_embeds")
def test_inputs_embeds(self):
pass
@unittest.skip(reason="Perceiver doesn't support the AutoModel API")
def test_load_with_mismatched_shapes(self):
pass
@slow
def test_model_from_pretrained(self):
for model_name in PERCEIVER_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
model = PerceiverModel.from_pretrained(model_name)
self.assertIsNotNone(model)
# We will verify our results on an image of cute cats
def prepare_img():
image = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
return image
# Helper functions for optical flow integration test
def prepare_optical_flow_images():
dataset = load_dataset("hf-internal-testing/fixtures_sintel", split="test")
image1 = Image.open(dataset[0]["file"]).convert("RGB")
image2 = Image.open(dataset[0]["file"]).convert("RGB")
return image1, image2
def normalize(img):
return img / 255.0 * 2 - 1
def extract_image_patches(x, kernel, stride=1, dilation=1):
# Do TF 'SAME' Padding
b, c, h, w = x.shape
h2 = math.ceil(h / stride)
w2 = math.ceil(w / stride)
pad_row = (h2 - 1) * stride + (kernel - 1) * dilation + 1 - h
pad_col = (w2 - 1) * stride + (kernel - 1) * dilation + 1 - w
x = torch.nn.functional.pad(x, (pad_row // 2, pad_row - pad_row // 2, pad_col // 2, pad_col - pad_col // 2))
# Extract patches
patches = x.unfold(2, kernel, stride).unfold(3, kernel, stride)
patches = patches.permute(0, 4, 5, 1, 2, 3).contiguous()
return patches.view(b, -1, patches.shape[-2], patches.shape[-1])
@require_torch
@require_vision
class PerceiverModelIntegrationTest(unittest.TestCase):
@slow
def test_inference_masked_lm(self):
tokenizer = PerceiverTokenizer.from_pretrained("deepmind/language-perceiver")
model = PerceiverForMaskedLM.from_pretrained("deepmind/language-perceiver")
model.to(torch_device)
# prepare inputs
text = "This is an incomplete sentence where some words are missing."
encoding = tokenizer(text, padding="max_length", return_tensors="pt")
# mask " missing.".
encoding.input_ids[0, 52:61] = tokenizer.mask_token_id
inputs, input_mask = encoding.input_ids.to(torch_device), encoding.attention_mask.to(torch_device)
# forward pass
with torch.no_grad():
outputs = model(inputs=inputs, attention_mask=input_mask)
logits = outputs.logits
# verify logits
expected_shape = torch.Size((1, tokenizer.model_max_length, tokenizer.vocab_size))
self.assertEqual(logits.shape, expected_shape)
expected_slice = torch.tensor(
[[-10.8609, -10.7651, -10.9187], [-12.1689, -11.9389, -12.1479], [-12.1518, -11.9707, -12.2073]]
)
self.assertTrue(torch.allclose(logits[0, :3, :3], expected_slice, atol=1e-4))
expected_greedy_predictions = [38, 115, 111, 121, 121, 111, 116, 109, 52]
masked_tokens_predictions = logits[0, 52:61].argmax(dim=-1).tolist()
self.assertListEqual(expected_greedy_predictions, masked_tokens_predictions)
@slow
def test_inference_image_classification(self):
feature_extractor = PerceiverFeatureExtractor()
model = PerceiverForImageClassificationLearned.from_pretrained("deepmind/vision-perceiver-learned")
model.to(torch_device)
# prepare inputs
image = prepare_img()
inputs = feature_extractor(image, return_tensors="pt").pixel_values.to(torch_device)
input_mask = None
# forward pass
with torch.no_grad():
outputs = model(inputs=inputs, attention_mask=input_mask)
logits = outputs.logits
# verify logits
expected_shape = torch.Size((1, model.config.num_labels))
self.assertEqual(logits.shape, expected_shape)
expected_slice = torch.tensor([-1.1653, -0.1993, -0.7521], device=torch_device)
self.assertTrue(torch.allclose(logits[0, :3], expected_slice, atol=1e-4))
@slow
def test_inference_image_classification_fourier(self):
feature_extractor = PerceiverFeatureExtractor()
model = PerceiverForImageClassificationFourier.from_pretrained("deepmind/vision-perceiver-fourier")
model.to(torch_device)
# prepare inputs
image = prepare_img()
inputs = feature_extractor(image, return_tensors="pt").pixel_values.to(torch_device)
input_mask = None
# forward pass
with torch.no_grad():
outputs = model(inputs=inputs, attention_mask=input_mask)
logits = outputs.logits
# verify logits
expected_shape = torch.Size((1, model.config.num_labels))
self.assertEqual(logits.shape, expected_shape)
expected_slice = torch.tensor([-1.1295, -0.2832, 0.3226], device=torch_device)
self.assertTrue(torch.allclose(logits[0, :3], expected_slice, atol=1e-4))
@slow
def test_inference_image_classification_conv(self):
feature_extractor = PerceiverFeatureExtractor()
model = PerceiverForImageClassificationConvProcessing.from_pretrained("deepmind/vision-perceiver-conv")
model.to(torch_device)
# prepare inputs
image = prepare_img()
inputs = feature_extractor(image, return_tensors="pt").pixel_values.to(torch_device)
input_mask = None
# forward pass
with torch.no_grad():
outputs = model(inputs=inputs, attention_mask=input_mask)
logits = outputs.logits
# verify logits
expected_shape = torch.Size((1, model.config.num_labels))
self.assertEqual(logits.shape, expected_shape)
expected_slice = torch.tensor([-1.1186, 0.0554, 0.0897], device=torch_device)
self.assertTrue(torch.allclose(logits[0, :3], expected_slice, atol=1e-4))
@slow
def test_inference_optical_flow(self):
model = PerceiverForOpticalFlow.from_pretrained("deepmind/optical-flow-perceiver")
model.to(torch_device)
# prepare inputs
image1, image2 = prepare_optical_flow_images()
img1 = normalize(np.array(image1))
img2 = normalize(np.array(image1))
# stack images
img1 = torch.tensor(np.moveaxis(img1, -1, 0))
img2 = torch.tensor(np.moveaxis(img2, -1, 0))
images = torch.stack([img1, img2], dim=0)
# extract 3x3 patches
patch_size = model.config.train_size
inputs = images[..., : patch_size[0], : patch_size[1]].unsqueeze(0)
batch_size, _, C, H, W = inputs.shape
patches = extract_image_patches(inputs.view(batch_size * 2, C, H, W), kernel=3)
_, C, H, W = patches.shape
patches = patches.view(batch_size, -1, C, H, W).float()
# forward pass
with torch.no_grad():
outputs = model(inputs=patches)
logits = outputs.logits
# verify logits
expected_shape = torch.Size((1, 368, 496, 2))
self.assertEqual(logits.shape, expected_shape)
expected_slice = torch.tensor(
[
[[0.0025, -0.0050], [0.0025, -0.0049], [0.0025, -0.0048]],
[[0.0026, -0.0049], [0.0026, -0.0048], [0.0026, -0.0047]],
[[0.0026, -0.0049], [0.0026, -0.0048], [0.0026, -0.0046]],
],
device=torch_device,
)
self.assertTrue(torch.allclose(logits[0, :3, :3, :3], expected_slice, atol=1e-4))
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment