Unverified Commit bb6f6d53 authored by NielsRogge's avatar NielsRogge Committed by GitHub
Browse files

Add X-CLIP (#18852)

* First draft

* Improve conversion script

* Make vision encoder work

* More improvements

* Improve conversion script

* Fix quality

* Add MultiframeIntegrationTransformer

* More improvements

* Make MiT output work

* Fix quality

* Add prompts generator

* Add tests

* Fix some tests

* Fix some more tests

* Fix more tests

* Improve conversion script

* Fix model outputs

* Fix more tests

* Add XClipProcessor

* Use processor in conversion script

* Fix integration test

* Update README, fix docs

* Fix all tests

* Add MIT output to XClipOutput

* Create better variable names

* Rename XClip to XCLIP

* Extend conversion script

* Add support for large models

* Add support for 16 frame models

* Add another model'

* Fix module issue

* Apply suggestions from code review

* Add figure to docs

* Fix CLIPProcessor issue

* Apply suggestions from code review

* Delete file

* Convert more checkpoints

* Convert last checkpoint

* Update nielsr to microsoft
parent 9832ac7c
...@@ -383,6 +383,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h ...@@ -383,6 +383,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino. 1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli. 1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli.
1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei. 1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
1. **[X-CLIP](https://huggingface.co/docs/transformers/main/model_doc/xclip)** (from Microsoft Research) released with the paper [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling.
1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li. 1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li.
1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau. 1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
1. **[XLM-ProphetNet](https://huggingface.co/docs/transformers/model_doc/xlm-prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. 1. **[XLM-ProphetNet](https://huggingface.co/docs/transformers/model_doc/xlm-prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
......
...@@ -335,6 +335,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는 ...@@ -335,6 +335,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino. 1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli. 1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli.
1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei. 1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
1. **[X-CLIP](https://huggingface.co/docs/transformers/main/model_doc/xclip)** (from Microsoft Research) released with the paper [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling.
1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li. 1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li.
1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau. 1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
1. **[XLM-ProphetNet](https://huggingface.co/docs/transformers/model_doc/xlm-prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. 1. **[XLM-ProphetNet](https://huggingface.co/docs/transformers/model_doc/xlm-prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
......
...@@ -359,6 +359,7 @@ conda install -c huggingface transformers ...@@ -359,6 +359,7 @@ conda install -c huggingface transformers
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (来自 Facebook AI) 伴随论文 [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) 由 Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino 发布。 1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (来自 Facebook AI) 伴随论文 [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) 由 Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino 发布。
1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (来自 Facebook AI) 伴随论文 [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) 由 Qiantong Xu, Alexei Baevski, Michael Auli 发布。 1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (来自 Facebook AI) 伴随论文 [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) 由 Qiantong Xu, Alexei Baevski, Michael Auli 发布。
1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei. 1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
1. **[X-CLIP](https://huggingface.co/docs/transformers/main/model_doc/xclip)** (来自 Microsoft Research) 伴随论文 [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) 由 Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling 发布。
1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li. 1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li.
1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (来自 Facebook) 伴随论文 [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) 由 Guillaume Lample and Alexis Conneau 发布。 1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (来自 Facebook) 伴随论文 [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) 由 Guillaume Lample and Alexis Conneau 发布。
1. **[XLM-ProphetNet](https://huggingface.co/docs/transformers/model_doc/xlm-prophetnet)** (来自 Microsoft Research) 伴随论文 [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) 由 Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou 发布。 1. **[XLM-ProphetNet](https://huggingface.co/docs/transformers/model_doc/xlm-prophetnet)** (来自 Microsoft Research) 伴随论文 [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) 由 Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou 发布。
......
...@@ -371,6 +371,7 @@ conda install -c huggingface transformers ...@@ -371,6 +371,7 @@ conda install -c huggingface transformers
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino. 1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli. 1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli.
1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei. 1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
1. **[X-CLIP](https://huggingface.co/docs/transformers/main/model_doc/xclip)** (from Microsoft Research) released with the paper [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling.
1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li. 1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li.
1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau. 1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
1. **[XLM-ProphetNet](https://huggingface.co/docs/transformers/model_doc/xlm-prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. 1. **[XLM-ProphetNet](https://huggingface.co/docs/transformers/model_doc/xlm-prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
......
...@@ -470,6 +470,8 @@ ...@@ -470,6 +470,8 @@
title: Vision Text Dual Encoder title: Vision Text Dual Encoder
- local: model_doc/visual_bert - local: model_doc/visual_bert
title: VisualBERT title: VisualBERT
- local: model_doc/xclip
title: X-CLIP
title: Multimodal models title: Multimodal models
- isExpanded: false - isExpanded: false
sections: sections:
......
...@@ -175,6 +175,7 @@ The documentation is organized into five sections: ...@@ -175,6 +175,7 @@ The documentation is organized into five sections:
1. **[Wav2Vec2-Conformer](model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino. 1. **[Wav2Vec2-Conformer](model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
1. **[Wav2Vec2Phoneme](model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli. 1. **[Wav2Vec2Phoneme](model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli.
1. **[WavLM](model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei. 1. **[WavLM](model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
1. **[X-CLIP](model_doc/xclip)** (from Microsoft Research) released with the paper [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling.
1. **[XGLM](model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li. 1. **[XGLM](model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li.
1. **[XLM](model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau. 1. **[XLM](model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
1. **[XLM-ProphetNet](model_doc/xlm-prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. 1. **[XLM-ProphetNet](model_doc/xlm-prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
...@@ -312,6 +313,7 @@ Flax), PyTorch, and/or TensorFlow. ...@@ -312,6 +313,7 @@ Flax), PyTorch, and/or TensorFlow.
| Wav2Vec2 | ✅ | ❌ | ✅ | ✅ | ✅ | | Wav2Vec2 | ✅ | ❌ | ✅ | ✅ | ✅ |
| Wav2Vec2-Conformer | ❌ | ❌ | ✅ | ❌ | ❌ | | Wav2Vec2-Conformer | ❌ | ❌ | ✅ | ❌ | ❌ |
| WavLM | ❌ | ❌ | ✅ | ❌ | ❌ | | WavLM | ❌ | ❌ | ✅ | ❌ | ❌ |
| X-CLIP | ❌ | ❌ | ✅ | ❌ | ❌ |
| XGLM | ✅ | ✅ | ✅ | ✅ | ✅ | | XGLM | ✅ | ✅ | ✅ | ✅ | ✅ |
| XLM | ✅ | ❌ | ✅ | ✅ | ❌ | | XLM | ✅ | ❌ | ✅ | ✅ | ❌ |
| XLM-ProphetNet | ✅ | ❌ | ✅ | ❌ | ❌ | | XLM-ProphetNet | ✅ | ❌ | ✅ | ❌ | ❌ |
......
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# X-CLIP
## Overview
The X-CLIP model was proposed in [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling.
X-CLIP is a minimal extension of [CLIP](clip) for video. The model consists of a text encoder, a cross-frame vision encoder, a multi-frame integration Transformer, and a video-specific prompt generator.
The abstract from the paper is the following:
*Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable "zero-shot" generalization ability for various image tasks. However, how to effectively expand such new language-image pretraining methods to video domains is still an open problem. In this work, we present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly, instead of pretraining a new model from scratch. More concretely, to capture the long-range dependencies of frames along the temporal dimension, we propose a cross-frame attention mechanism that explicitly exchanges information across frames. Such module is lightweight and can be plugged into pretrained language-image models seamlessly. Moreover, we propose a video-specific prompting scheme, which leverages video content information for generating discriminative textual prompts. Extensive experiments demonstrate that our approach is effective and can be generalized to different video recognition scenarios. In particular, under fully-supervised settings, our approach achieves a top-1 accuracy of 87.1% on Kinectics-400, while using 12 times fewer FLOPs compared with Swin-L and ViViT-H. In zero-shot experiments, our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols. In few-shot scenarios, our approach outperforms previous best methods by +32.1% and +23.1% when the labeled data is extremely limited.*
Tips:
- Usage of X-CLIP is identical to CLIP.
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/xclip_architecture.png"
alt="drawing" width="600"/>
<small> X-CLIP architecture. Taken from the <a href="https://arxiv.org/abs/2208.02816">original paper.</a> </small>
This model was contributed by [nielsr](https://huggingface.co/nielsr).
The original code can be found [here](https://github.com/microsoft/VideoX/tree/master/X-CLIP).
## XCLIPProcessor
[[autodoc]] XCLIPProcessor
## XCLIPConfig
[[autodoc]] XCLIPConfig
- from_text_vision_configs
## XCLIPTextConfig
[[autodoc]] XCLIPTextConfig
## XCLIPVisionConfig
[[autodoc]] XCLIPVisionConfig
## XCLIPModel
[[autodoc]] XCLIPModel
- forward
- get_text_features
- get_video_features
## XCLIPTextModel
[[autodoc]] XCLIPTextModel
- forward
## XCLIPVisionModel
[[autodoc]] XCLIPVisionModel
- forward
...@@ -165,6 +165,7 @@ _import_structure = { ...@@ -165,6 +165,7 @@ _import_structure = {
"models.clip": [ "models.clip": [
"CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP", "CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP",
"CLIPConfig", "CLIPConfig",
"CLIPProcessor",
"CLIPTextConfig", "CLIPTextConfig",
"CLIPTokenizer", "CLIPTokenizer",
"CLIPVisionConfig", "CLIPVisionConfig",
...@@ -368,6 +369,13 @@ _import_structure = { ...@@ -368,6 +369,13 @@ _import_structure = {
"WAVLM_PRETRAINED_CONFIG_ARCHIVE_MAP", "WAVLM_PRETRAINED_CONFIG_ARCHIVE_MAP",
"WavLMConfig", "WavLMConfig",
], ],
"models.x_clip": [
"XCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP",
"XCLIPConfig",
"XCLIPProcessor",
"XCLIPTextConfig",
"XCLIPVisionConfig",
],
"models.xglm": ["XGLM_PRETRAINED_CONFIG_ARCHIVE_MAP", "XGLMConfig"], "models.xglm": ["XGLM_PRETRAINED_CONFIG_ARCHIVE_MAP", "XGLMConfig"],
"models.xlm": ["XLM_PRETRAINED_CONFIG_ARCHIVE_MAP", "XLMConfig", "XLMTokenizer"], "models.xlm": ["XLM_PRETRAINED_CONFIG_ARCHIVE_MAP", "XLMConfig", "XLMTokenizer"],
"models.xlm_prophetnet": ["XLM_PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP", "XLMProphetNetConfig"], "models.xlm_prophetnet": ["XLM_PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP", "XLMProphetNetConfig"],
...@@ -641,7 +649,6 @@ else: ...@@ -641,7 +649,6 @@ else:
_import_structure["image_utils"] = ["ImageFeatureExtractionMixin"] _import_structure["image_utils"] = ["ImageFeatureExtractionMixin"]
_import_structure["models.beit"].append("BeitFeatureExtractor") _import_structure["models.beit"].append("BeitFeatureExtractor")
_import_structure["models.clip"].append("CLIPFeatureExtractor") _import_structure["models.clip"].append("CLIPFeatureExtractor")
_import_structure["models.clip"].append("CLIPProcessor")
_import_structure["models.convnext"].append("ConvNextFeatureExtractor") _import_structure["models.convnext"].append("ConvNextFeatureExtractor")
_import_structure["models.deit"].append("DeiTFeatureExtractor") _import_structure["models.deit"].append("DeiTFeatureExtractor")
_import_structure["models.detr"].append("DetrFeatureExtractor") _import_structure["models.detr"].append("DetrFeatureExtractor")
...@@ -988,6 +995,15 @@ else: ...@@ -988,6 +995,15 @@ else:
"CLIPVisionModel", "CLIPVisionModel",
] ]
) )
_import_structure["models.x_clip"].extend(
[
"XCLIP_PRETRAINED_MODEL_ARCHIVE_LIST",
"XCLIPModel",
"XCLIPPreTrainedModel",
"XCLIPTextModel",
"XCLIPVisionModel",
]
)
_import_structure["models.convbert"].extend( _import_structure["models.convbert"].extend(
[ [
"CONVBERT_PRETRAINED_MODEL_ARCHIVE_LIST", "CONVBERT_PRETRAINED_MODEL_ARCHIVE_LIST",
...@@ -3012,6 +3028,7 @@ if TYPE_CHECKING: ...@@ -3012,6 +3028,7 @@ if TYPE_CHECKING:
from .models.clip import ( from .models.clip import (
CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP, CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP,
CLIPConfig, CLIPConfig,
CLIPProcessor,
CLIPTextConfig, CLIPTextConfig,
CLIPTokenizer, CLIPTokenizer,
CLIPVisionConfig, CLIPVisionConfig,
...@@ -3189,6 +3206,13 @@ if TYPE_CHECKING: ...@@ -3189,6 +3206,13 @@ if TYPE_CHECKING:
from .models.wav2vec2_phoneme import Wav2Vec2PhonemeCTCTokenizer from .models.wav2vec2_phoneme import Wav2Vec2PhonemeCTCTokenizer
from .models.wav2vec2_with_lm import Wav2Vec2ProcessorWithLM from .models.wav2vec2_with_lm import Wav2Vec2ProcessorWithLM
from .models.wavlm import WAVLM_PRETRAINED_CONFIG_ARCHIVE_MAP, WavLMConfig from .models.wavlm import WAVLM_PRETRAINED_CONFIG_ARCHIVE_MAP, WavLMConfig
from .models.x_clip import (
XCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP,
XCLIPConfig,
XCLIPProcessor,
XCLIPTextConfig,
XCLIPVisionConfig,
)
from .models.xglm import XGLM_PRETRAINED_CONFIG_ARCHIVE_MAP, XGLMConfig from .models.xglm import XGLM_PRETRAINED_CONFIG_ARCHIVE_MAP, XGLMConfig
from .models.xlm import XLM_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMConfig, XLMTokenizer from .models.xlm import XLM_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMConfig, XLMTokenizer
from .models.xlm_prophetnet import XLM_PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMProphetNetConfig from .models.xlm_prophetnet import XLM_PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMProphetNetConfig
...@@ -3428,7 +3452,7 @@ if TYPE_CHECKING: ...@@ -3428,7 +3452,7 @@ if TYPE_CHECKING:
else: else:
from .image_utils import ImageFeatureExtractionMixin from .image_utils import ImageFeatureExtractionMixin
from .models.beit import BeitFeatureExtractor from .models.beit import BeitFeatureExtractor
from .models.clip import CLIPFeatureExtractor, CLIPProcessor from .models.clip import CLIPFeatureExtractor
from .models.convnext import ConvNextFeatureExtractor from .models.convnext import ConvNextFeatureExtractor
from .models.deit import DeiTFeatureExtractor from .models.deit import DeiTFeatureExtractor
from .models.detr import DetrFeatureExtractor from .models.detr import DetrFeatureExtractor
...@@ -4499,6 +4523,13 @@ if TYPE_CHECKING: ...@@ -4499,6 +4523,13 @@ if TYPE_CHECKING:
WavLMModel, WavLMModel,
WavLMPreTrainedModel, WavLMPreTrainedModel,
) )
from .models.x_clip import (
XCLIP_PRETRAINED_MODEL_ARCHIVE_LIST,
XCLIPModel,
XCLIPPreTrainedModel,
XCLIPTextModel,
XCLIPVisionModel,
)
from .models.xglm import XGLM_PRETRAINED_MODEL_ARCHIVE_LIST, XGLMForCausalLM, XGLMModel, XGLMPreTrainedModel from .models.xglm import XGLM_PRETRAINED_MODEL_ARCHIVE_LIST, XGLMForCausalLM, XGLMModel, XGLMPreTrainedModel
from .models.xlm import ( from .models.xlm import (
XLM_PRETRAINED_MODEL_ARCHIVE_LIST, XLM_PRETRAINED_MODEL_ARCHIVE_LIST,
......
...@@ -151,6 +151,7 @@ from . import ( ...@@ -151,6 +151,7 @@ from . import (
wav2vec2_phoneme, wav2vec2_phoneme,
wav2vec2_with_lm, wav2vec2_with_lm,
wavlm, wavlm,
x_clip,
xglm, xglm,
xlm, xlm,
xlm_prophetnet, xlm_prophetnet,
......
...@@ -144,6 +144,7 @@ CONFIG_MAPPING_NAMES = OrderedDict( ...@@ -144,6 +144,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
("wav2vec2", "Wav2Vec2Config"), ("wav2vec2", "Wav2Vec2Config"),
("wav2vec2-conformer", "Wav2Vec2ConformerConfig"), ("wav2vec2-conformer", "Wav2Vec2ConformerConfig"),
("wavlm", "WavLMConfig"), ("wavlm", "WavLMConfig"),
("xclip", "XCLIPConfig"),
("xglm", "XGLMConfig"), ("xglm", "XGLMConfig"),
("xlm", "XLMConfig"), ("xlm", "XLMConfig"),
("xlm-prophetnet", "XLMProphetNetConfig"), ("xlm-prophetnet", "XLMProphetNetConfig"),
...@@ -259,6 +260,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict( ...@@ -259,6 +260,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
("vit_mae", "VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("vit_mae", "VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("wav2vec2", "WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("wav2vec2", "WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("wav2vec2-conformer", "WAV2VEC2_CONFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("wav2vec2-conformer", "WAV2VEC2_CONFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("xclip", "X_CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("xglm", "XGLM_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("xglm", "XGLM_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("xlm", "XLM_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("xlm", "XLM_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("xlm-prophetnet", "XLM_PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("xlm-prophetnet", "XLM_PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP"),
...@@ -408,6 +410,7 @@ MODEL_NAMES_MAPPING = OrderedDict( ...@@ -408,6 +410,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
("wav2vec2-conformer", "Wav2Vec2-Conformer"), ("wav2vec2-conformer", "Wav2Vec2-Conformer"),
("wav2vec2_phoneme", "Wav2Vec2Phoneme"), ("wav2vec2_phoneme", "Wav2Vec2Phoneme"),
("wavlm", "WavLM"), ("wavlm", "WavLM"),
("xclip", "X-CLIP"),
("xglm", "XGLM"), ("xglm", "XGLM"),
("xlm", "XLM"), ("xlm", "XLM"),
("xlm-prophetnet", "XLM-ProphetNet"), ("xlm-prophetnet", "XLM-ProphetNet"),
...@@ -428,6 +431,7 @@ SPECIAL_MODEL_TYPE_TO_MODULE_NAME = OrderedDict( ...@@ -428,6 +431,7 @@ SPECIAL_MODEL_TYPE_TO_MODULE_NAME = OrderedDict(
("data2vec-text", "data2vec"), ("data2vec-text", "data2vec"),
("data2vec-vision", "data2vec"), ("data2vec-vision", "data2vec"),
("donut-swin", "donut"), ("donut-swin", "donut"),
("xclip", "x_clip"),
] ]
) )
......
...@@ -75,6 +75,7 @@ FEATURE_EXTRACTOR_MAPPING_NAMES = OrderedDict( ...@@ -75,6 +75,7 @@ FEATURE_EXTRACTOR_MAPPING_NAMES = OrderedDict(
("vit_mae", "ViTFeatureExtractor"), ("vit_mae", "ViTFeatureExtractor"),
("wav2vec2", "Wav2Vec2FeatureExtractor"), ("wav2vec2", "Wav2Vec2FeatureExtractor"),
("wav2vec2-conformer", "Wav2Vec2FeatureExtractor"), ("wav2vec2-conformer", "Wav2Vec2FeatureExtractor"),
("xclip", "CLIPFeatureExtractor"),
("yolos", "YolosFeatureExtractor"), ("yolos", "YolosFeatureExtractor"),
] ]
) )
......
...@@ -138,6 +138,7 @@ MODEL_MAPPING_NAMES = OrderedDict( ...@@ -138,6 +138,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
("wav2vec2", "Wav2Vec2Model"), ("wav2vec2", "Wav2Vec2Model"),
("wav2vec2-conformer", "Wav2Vec2ConformerModel"), ("wav2vec2-conformer", "Wav2Vec2ConformerModel"),
("wavlm", "WavLMModel"), ("wavlm", "WavLMModel"),
("xclip", "XCLIPModel"),
("xglm", "XGLMModel"), ("xglm", "XGLMModel"),
("xlm", "XLMModel"), ("xlm", "XLMModel"),
("xlm-prophetnet", "XLMProphetNetModel"), ("xlm-prophetnet", "XLMProphetNetModel"),
......
...@@ -58,6 +58,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict( ...@@ -58,6 +58,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
("wav2vec2-conformer", "Wav2Vec2Processor"), ("wav2vec2-conformer", "Wav2Vec2Processor"),
("wav2vec2_with_lm", "Wav2Vec2ProcessorWithLM"), ("wav2vec2_with_lm", "Wav2Vec2ProcessorWithLM"),
("wavlm", "Wav2Vec2Processor"), ("wavlm", "Wav2Vec2Processor"),
("xclip", "CLIPProcessor"),
] ]
) )
......
...@@ -253,6 +253,7 @@ else: ...@@ -253,6 +253,7 @@ else:
("wav2vec2", ("Wav2Vec2CTCTokenizer", None)), ("wav2vec2", ("Wav2Vec2CTCTokenizer", None)),
("wav2vec2-conformer", ("Wav2Vec2CTCTokenizer", None)), ("wav2vec2-conformer", ("Wav2Vec2CTCTokenizer", None)),
("wav2vec2_phoneme", ("Wav2Vec2PhonemeCTCTokenizer", None)), ("wav2vec2_phoneme", ("Wav2Vec2PhonemeCTCTokenizer", None)),
("xclip", ("CLIPTokenizer", "CLIPTokenizerFast" if is_tokenizers_available() else None)),
( (
"xglm", "xglm",
( (
......
...@@ -36,6 +36,7 @@ _import_structure = { ...@@ -36,6 +36,7 @@ _import_structure = {
"CLIPTextConfig", "CLIPTextConfig",
"CLIPVisionConfig", "CLIPVisionConfig",
], ],
"processing_clip": ["CLIPProcessor"],
"tokenization_clip": ["CLIPTokenizer"], "tokenization_clip": ["CLIPTokenizer"],
} }
...@@ -54,7 +55,6 @@ except OptionalDependencyNotAvailable: ...@@ -54,7 +55,6 @@ except OptionalDependencyNotAvailable:
pass pass
else: else:
_import_structure["feature_extraction_clip"] = ["CLIPFeatureExtractor"] _import_structure["feature_extraction_clip"] = ["CLIPFeatureExtractor"]
_import_structure["processing_clip"] = ["CLIPProcessor"]
try: try:
if not is_torch_available(): if not is_torch_available():
...@@ -108,6 +108,7 @@ if TYPE_CHECKING: ...@@ -108,6 +108,7 @@ if TYPE_CHECKING:
CLIPTextConfig, CLIPTextConfig,
CLIPVisionConfig, CLIPVisionConfig,
) )
from .processing_clip import CLIPProcessor
from .tokenization_clip import CLIPTokenizer from .tokenization_clip import CLIPTokenizer
try: try:
...@@ -125,7 +126,6 @@ if TYPE_CHECKING: ...@@ -125,7 +126,6 @@ if TYPE_CHECKING:
pass pass
else: else:
from .feature_extraction_clip import CLIPFeatureExtractor from .feature_extraction_clip import CLIPFeatureExtractor
from .processing_clip import CLIPProcessor
try: try:
if not is_torch_available(): if not is_torch_available():
......
# flake8: noqa
# There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all.
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available
_import_structure = {
"configuration_x_clip": [
"XCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP",
"XCLIPConfig",
"XCLIPTextConfig",
"XCLIPVisionConfig",
],
"processing_x_clip": ["XCLIPProcessor"],
}
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["modeling_x_clip"] = [
"XCLIP_PRETRAINED_MODEL_ARCHIVE_LIST",
"XCLIPModel",
"XCLIPPreTrainedModel",
"XCLIPTextModel",
"XCLIPVisionModel",
]
if TYPE_CHECKING:
from .configuration_x_clip import (
XCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP,
XCLIPConfig,
XCLIPTextConfig,
XCLIPVisionConfig,
)
from .processing_x_clip import XCLIPProcessor
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .modeling_x_clip import (
XCLIP_PRETRAINED_MODEL_ARCHIVE_LIST,
XCLIPModel,
XCLIPPreTrainedModel,
XCLIPTextModel,
XCLIPVisionModel,
)
else:
import sys
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
# coding=utf-8
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" X-CLIP model configuration"""
import copy
import os
from typing import Union
from ...configuration_utils import PretrainedConfig
from ...utils import logging
logger = logging.get_logger(__name__)
XCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"microsoft/xclip-base-patch32": "https://huggingface.co/microsoft/xclip-base-patch32/resolve/main/config.json",
}
class XCLIPTextConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`XCLIPModel`]. It is used to instantiate an X-CLIP
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a similar configuration to that of the X-CLIP
[microsoft/xclip-base-patch32](https://huggingface.co/microsoft/xclip-base-patch32) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
vocab_size (`int`, *optional*, defaults to 49408):
Vocabulary size of the X-CLIP text model. Defines the number of different tokens that can be represented by
the `inputs_ids` passed when calling [`XCLIPModel`].
hidden_size (`int`, *optional*, defaults to 512):
Dimensionality of the encoder layers and the pooler layer.
intermediate_size (`int`, *optional*, defaults to 2048):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
num_hidden_layers (`int`, *optional*, defaults to 12):
Number of hidden layers in the Transformer encoder.
num_attention_heads (`int`, *optional*, defaults to 8):
Number of attention heads for each attention layer in the Transformer encoder.
max_position_embeddings (`int`, *optional*, defaults to 77):
The maximum sequence length that this model might ever be used with. Typically set this to something large
just in case (e.g., 512 or 1024 or 2048).
hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
`"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported.
layer_norm_eps (`float`, *optional*, defaults to 1e-5):
The epsilon used by the layer normalization layers.
attention_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for the attention probabilities.
dropout (`float`, *optional*, defaults to 0.0):
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
initializer_factor (`float``, *optional*, defaults to 1):
A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
testing).
Example:
```python
>>> from transformers import XCLIPTextModel, XCLIPTextConfig
>>> # Initializing a XCLIPTextModel with microsoft/xclip-base-patch32 style configuration
>>> configuration = XCLIPTextConfig()
>>> # Initializing a XCLIPTextConfig from the microsoft/xclip-base-patch32 style configuration
>>> model = XCLIPTextModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "xclip_text_model"
def __init__(
self,
vocab_size=49408,
hidden_size=512,
intermediate_size=2048,
num_hidden_layers=12,
num_attention_heads=8,
max_position_embeddings=77,
hidden_act="quick_gelu",
layer_norm_eps=0.00001,
dropout=0.0,
attention_dropout=0.0,
initializer_range=0.02,
initializer_factor=1.0,
pad_token_id=1,
bos_token_id=0,
eos_token_id=2,
**kwargs
):
super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.intermediate_size = intermediate_size
self.dropout = dropout
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.max_position_embeddings = max_position_embeddings
self.layer_norm_eps = layer_norm_eps
self.hidden_act = hidden_act
self.initializer_range = initializer_range
self.initializer_factor = initializer_factor
self.attention_dropout = attention_dropout
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
# get the text config dict if we are loading from XCLIPConfig
if config_dict.get("model_type") == "xclip":
config_dict = config_dict["text_config"]
if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
logger.warning(
f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
)
return cls.from_dict(config_dict, **kwargs)
class XCLIPVisionConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`XCLIPModel`]. It is used to instantiate an X-CLIP
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a similar configuration to that of the X-CLIP
[microsoft/xclip-base-patch32](https://huggingface.co/microsoft/xclip-base-patch32) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
hidden_size (`int`, *optional*, defaults to 768):
Dimensionality of the encoder layers and the pooler layer.
intermediate_size (`int`, *optional*, defaults to 3072):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
num_hidden_layers (`int`, *optional*, defaults to 12):
Number of hidden layers in the Transformer encoder.
num_attention_heads (`int`, *optional*, defaults to 12):
Number of attention heads for each attention layer in the Transformer encoder.
mit_hidden_size (`int`, *optional*, defaults to 512):
Dimensionality of the encoder layers of the Multiframe Integration Transformer (MIT).
mit_intermediate_size (`int`, *optional*, defaults to 2048):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Multiframe Integration Transformer
(MIT).
mit_num_hidden_layers (`int`, *optional*, defaults to 1):
Number of hidden layers in the Multiframe Integration Transformer (MIT).
mit_num_attention_heads (`int`, *optional*, defaults to 8):
Number of attention heads for each attention layer in the Multiframe Integration Transformer (MIT).
image_size (`int`, *optional*, defaults to 224):
The size (resolution) of each image.
patch_size (`int`, *optional*, defaults to 32):
The size (resolution) of each patch.
hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
`"relu"`, `"selu"`, `"gelu_new"` and ``"quick_gelu"` are supported.
layer_norm_eps (`float`, *optional*, defaults to 1e-5):
The epsilon used by the layer normalization layers.
dropout (`float`, *optional*, defaults to 0.0):
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for the attention probabilities.
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
initializer_factor (`float``, *optional*, defaults to 1):
A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
testing).
drop_path_rate (`float`, *optional*, defaults to 0.0):
Stochastic depth rate.
Example:
```python
>>> from transformers import XCLIPVisionModel, XCLIPVisionConfig
>>> # Initializing a XCLIPVisionModel with microsoft/xclip-base-patch32 style configuration
>>> configuration = XCLIPVisionConfig()
>>> # Initializing a XCLIPVisionModel model from the microsoft/xclip-base-patch32 style configuration
>>> model = XCLIPVisionModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "xclip_vision_model"
def __init__(
self,
hidden_size=768,
intermediate_size=3072,
num_hidden_layers=12,
num_attention_heads=12,
mit_hidden_size=512,
mit_intermediate_size=2048,
mit_num_hidden_layers=1,
mit_num_attention_heads=8,
num_channels=3,
image_size=224,
patch_size=32,
num_frames=8,
hidden_act="quick_gelu",
layer_norm_eps=0.00001,
dropout=0.0,
attention_dropout=0.0,
initializer_range=0.02,
initializer_factor=1.0,
drop_path_rate=0.0,
**kwargs
):
super().__init__(**kwargs)
self.hidden_size = hidden_size
self.intermediate_size = intermediate_size
self.dropout = dropout
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.mit_hidden_size = mit_hidden_size
self.mit_intermediate_size = mit_intermediate_size
self.mit_num_hidden_layers = mit_num_hidden_layers
self.mit_num_attention_heads = mit_num_attention_heads
self.num_channels = num_channels
self.patch_size = patch_size
self.num_frames = num_frames
self.image_size = image_size
self.initializer_range = initializer_range
self.initializer_factor = initializer_factor
self.attention_dropout = attention_dropout
self.layer_norm_eps = layer_norm_eps
self.hidden_act = hidden_act
self.drop_path_rate = drop_path_rate
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
# get the vision config dict if we are loading from XCLIPConfig
if config_dict.get("model_type") == "xclip":
config_dict = config_dict["vision_config"]
if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
logger.warning(
f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
)
return cls.from_dict(config_dict, **kwargs)
class XCLIPConfig(PretrainedConfig):
r"""
[`XCLIPConfig`] is the configuration class to store the configuration of a [`XCLIPModel`]. It is used to
instantiate X-CLIP model according to the specified arguments, defining the text model and vision model configs.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
text_config_dict (`dict`, *optional*):
Dictionary of configuration options used to initialize [`XCLIPTextConfig`].
vision_config_dict (`dict`, *optional*):
Dictionary of configuration options used to initialize [`XCLIPVisionConfig`].
projection_dim (`int`, *optional*, defaults to 512):
Dimentionality of text and vision projection layers.
prompt_layers (`int`, *optional*, defaults to 2):
Number of layers in the video specific prompt generator.
prompt_alpha (`float`, *optional*, defaults to 0.1):
Alpha value to use in the video specific prompt generator.
prompt_hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
The non-linear activation function (function or string) in the video specific prompt generator. If string,
`"gelu"`, `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported.
prompt_num_attention_heads (`int`, *optional*, defaults to 8):
Number of attention heads in the cross-attention of the video specific prompt generator.
prompt_attention_dropout (`float`, *optional*, defaults to 0.0):
The dropout probability for the attention layers in the video specific prompt generator.
prompt_projection_dropout (`float`, *optional*, defaults to 0.0):
The dropout probability for the projection layers in the video specific prompt generator.
logit_scale_init_value (`float`, *optional*, defaults to 2.6592):
The inital value of the *logit_scale* parameter. Default is used as per the original XCLIP implementation.
kwargs (*optional*):
Dictionary of keyword arguments.
"""
model_type = "xclip"
is_composition = True
def __init__(
self,
text_config_dict=None,
vision_config_dict=None,
projection_dim=512,
prompt_layers=2,
prompt_alpha=0.1,
prompt_hidden_act="quick_gelu",
prompt_num_attention_heads=8,
prompt_attention_dropout=0.0,
prompt_projection_dropout=0.0,
logit_scale_init_value=2.6592,
**kwargs
):
super().__init__(text_config_dict=text_config_dict, vision_config_dict=vision_config_dict, **kwargs)
if text_config_dict is None:
text_config_dict = {}
logger.info("text_config_dict is None. Initializing the XCLIPTextConfig with default values.")
if vision_config_dict is None:
vision_config_dict = {}
logger.info("vision_config_dict is None. initializing the XCLIPVisionConfig with default values.")
self.text_config = XCLIPTextConfig(**text_config_dict)
self.vision_config = XCLIPVisionConfig(**vision_config_dict)
self.projection_dim = projection_dim
self.prompt_layers = prompt_layers
self.prompt_alpha = prompt_alpha
self.prompt_hidden_act = prompt_hidden_act
self.prompt_num_attention_heads = prompt_num_attention_heads
self.prompt_attention_dropout = prompt_attention_dropout
self.prompt_projection_dropout = prompt_projection_dropout
self.logit_scale_init_value = logit_scale_init_value
self.initializer_factor = 1.0
@classmethod
def from_text_vision_configs(cls, text_config: XCLIPTextConfig, vision_config: XCLIPVisionConfig, **kwargs):
r"""
Instantiate a [`XCLIPConfig`] (or a derived class) from xclip text model configuration and xclip vision model
configuration.
Returns:
[`XCLIPConfig`]: An instance of a configuration object
"""
return cls(text_config_dict=text_config.to_dict(), vision_config_dict=vision_config.to_dict(), **kwargs)
def to_dict(self):
"""
Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
Returns:
`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
"""
output = copy.deepcopy(self.__dict__)
output["text_config"] = self.text_config.to_dict()
output["vision_config"] = self.vision_config.to_dict()
output["model_type"] = self.__class__.model_type
return output
# coding=utf-8
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import numpy as np
import torch
import gdown
from huggingface_hub import hf_hub_download
from transformers import (
CLIPTokenizer,
CLIPTokenizerFast,
VideoMAEFeatureExtractor,
XCLIPConfig,
XCLIPModel,
XCLIPProcessor,
XCLIPTextConfig,
XCLIPVisionConfig,
)
def get_xclip_config(model_name, num_frames):
text_config = XCLIPTextConfig()
# derive patch size from model name
start_idx = model_name.find("patch")
patch_size = int(model_name[start_idx + len("patch") : start_idx + len("patch") + 2])
vision_config = XCLIPVisionConfig(patch_size=patch_size, num_frames=num_frames)
if "large" in model_name:
text_config.hidden_size = 768
text_config.intermediate_size = 3072
text_config.num_attention_heads = 12
vision_config.hidden_size = 1024
vision_config.intermediate_size = 4096
vision_config.num_attention_heads = 16
vision_config.num_hidden_layers = 24
vision_config.mit_hidden_size = 768
vision_config.mit_intermediate_size = 3072
if model_name == "xclip-large-patch14-16-frames":
vision_config.image_size = 336
config = XCLIPConfig.from_text_vision_configs(text_config, vision_config)
if "large" in model_name:
config.projection_dim = 768
return config
def rename_key(name):
# text encoder
if name == "token_embedding.weight":
name = name.replace("token_embedding.weight", "text_model.embeddings.token_embedding.weight")
if name == "positional_embedding":
name = name.replace("positional_embedding", "text_model.embeddings.position_embedding.weight")
if "ln_1" in name:
name = name.replace("ln_1", "layer_norm1")
if "ln_2" in name:
name = name.replace("ln_2", "layer_norm2")
if "c_fc" in name:
name = name.replace("c_fc", "fc1")
if "c_proj" in name:
name = name.replace("c_proj", "fc2")
if name.startswith("transformer.resblocks"):
name = name.replace("transformer.resblocks", "text_model.encoder.layers")
if "attn.out_proj" in name and "message" not in name:
name = name.replace("attn.out_proj", "self_attn.out_proj")
if "ln_final" in name:
name = name.replace("ln_final", "text_model.final_layer_norm")
# visual encoder
if name == "visual.class_embedding":
name = name.replace("visual.class_embedding", "vision_model.embeddings.class_embedding")
if name == "visual.positional_embedding":
name = name.replace("visual.positional_embedding", "vision_model.embeddings.position_embedding.weight")
if name.startswith("visual.transformer.resblocks"):
name = name.replace("visual.transformer.resblocks", "vision_model.encoder.layers")
if "visual.conv1" in name:
name = name.replace("visual.conv1", "vision_model.embeddings.patch_embedding")
if "visual.ln_pre" in name:
name = name.replace("visual.ln_pre", "vision_model.pre_layernorm")
if "visual.ln_post" in name:
name = name.replace("visual.ln_post", "vision_model.post_layernorm")
if "visual.proj" in name:
name = name.replace("visual.proj", "visual_projection.weight")
if "text_projection" in name:
name = name.replace("text_projection", "text_projection.weight")
# things on top
if "prompts_visual_proj" in name:
name = name.replace("prompts_visual_proj", "prompts_visual_projection")
if "prompts_visual_ln" in name:
name = name.replace("prompts_visual_ln", "prompts_visual_layernorm")
# mit
if name == "mit.positional_embedding":
name = name.replace("positional", "position")
if name.startswith("mit.resblocks"):
name = name.replace("mit.resblocks", "mit.encoder.layers")
# prompts generator
if name.startswith("prompts_generator.norm"):
name = name.replace("prompts_generator.norm", "prompts_generator.layernorm")
return name
def convert_state_dict(orig_state_dict, config):
for key in orig_state_dict.copy().keys():
val = orig_state_dict.pop(key)
if "attn.in_proj" in key:
key_split = key.split(".")
if key.startswith("visual"):
layer_num = key_split[3]
dim = config.vision_config.hidden_size
if "message_attn" in key:
if "weight" in key:
orig_state_dict[f"vision_model.encoder.layers.{layer_num}.message_attn.q_proj.weight"] = val[
:dim, :
]
orig_state_dict[f"vision_model.encoder.layers.{layer_num}.message_attn.k_proj.weight"] = val[
dim : dim * 2, :
]
orig_state_dict[f"vision_model.encoder.layers.{layer_num}.message_attn.v_proj.weight"] = val[
-dim:, :
]
else:
orig_state_dict[f"vision_model.encoder.layers.{layer_num}.message_attn.q_proj.bias"] = val[
:dim
]
orig_state_dict[f"vision_model.encoder.layers.{layer_num}.message_attn.k_proj.bias"] = val[
dim : dim * 2
]
orig_state_dict[f"vision_model.encoder.layers.{layer_num}.message_attn.v_proj.bias"] = val[
-dim:
]
else:
if "weight" in key:
orig_state_dict[f"vision_model.encoder.layers.{layer_num}.self_attn.q_proj.weight"] = val[
:dim, :
]
orig_state_dict[f"vision_model.encoder.layers.{layer_num}.self_attn.k_proj.weight"] = val[
dim : dim * 2, :
]
orig_state_dict[f"vision_model.encoder.layers.{layer_num}.self_attn.v_proj.weight"] = val[
-dim:, :
]
else:
orig_state_dict[f"vision_model.encoder.layers.{layer_num}.self_attn.q_proj.bias"] = val[:dim]
orig_state_dict[f"vision_model.encoder.layers.{layer_num}.self_attn.k_proj.bias"] = val[
dim : dim * 2
]
orig_state_dict[f"vision_model.encoder.layers.{layer_num}.self_attn.v_proj.bias"] = val[-dim:]
elif key.startswith("mit"):
layer_num = key_split[2]
dim = config.vision_config.mit_hidden_size
if "weight" in key:
orig_state_dict[f"mit.encoder.layers.{layer_num}.self_attn.q_proj.weight"] = val[:dim, :]
orig_state_dict[f"mit.encoder.layers.{layer_num}.self_attn.k_proj.weight"] = val[dim : dim * 2, :]
orig_state_dict[f"mit.encoder.layers.{layer_num}.self_attn.v_proj.weight"] = val[-dim:, :]
else:
orig_state_dict[f"mit.encoder.layers.{layer_num}.self_attn.q_proj.bias"] = val[:dim]
orig_state_dict[f"mit.encoder.layers.{layer_num}.self_attn.k_proj.bias"] = val[dim : dim * 2]
orig_state_dict[f"mit.encoder.layers.{layer_num}.self_attn.v_proj.bias"] = val[-dim:]
else:
layer_num = key_split[2]
dim = config.text_config.hidden_size
if "weight" in key:
orig_state_dict[f"text_model.encoder.layers.{layer_num}.self_attn.q_proj.weight"] = val[:dim, :]
orig_state_dict[f"text_model.encoder.layers.{layer_num}.self_attn.k_proj.weight"] = val[
dim : dim * 2, :
]
orig_state_dict[f"text_model.encoder.layers.{layer_num}.self_attn.v_proj.weight"] = val[-dim:, :]
else:
orig_state_dict[f"text_model.encoder.layers.{layer_num}.self_attn.q_proj.bias"] = val[:dim]
orig_state_dict[f"text_model.encoder.layers.{layer_num}.self_attn.k_proj.bias"] = val[
dim : dim * 2
]
orig_state_dict[f"text_model.encoder.layers.{layer_num}.self_attn.v_proj.bias"] = val[-dim:]
else:
new_key_name = rename_key(key)
if new_key_name in ["visual_projection.weight", "text_projection.weight"]:
val = val.T
orig_state_dict[new_key_name] = val
return orig_state_dict
def prepare_video(num_frames):
if num_frames == 8:
filename = "eating_spaghetti_8_frames.npy"
elif num_frames == 16:
filename = "eating_spaghetti.npy"
elif num_frames == 32:
filename = "eating_spaghetti_32_frames.npy"
file = hf_hub_download(
repo_id="datasets/hf-internal-testing/spaghetti-video",
filename=filename,
)
video = np.load(file)
return list(video)
def convert_xclip_checkpoint(model_name, pytorch_dump_folder_path=None, push_to_hub=False):
model_to_url = {
# fully supervised kinetics-400 checkpoints
"xclip-base-patch32": "https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/k400_32_8.pth",
"xclip-base-patch32-16-frames": (
"https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/k400_32_16.pth"
),
"xclip-base-patch16": "https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/k400_16_8.pth",
"xclip-base-patch16-16-frames": (
"https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/k400_16_16.pth"
),
"xclip-large-patch14": "https://drive.google.com/u/0/uc?id=1NUOImq0o5DlQTST17iIP3vG7DgmHQuCx&amp;export=download&amp;confirm=t&amp;uuid=b26caedc-88e2-473e-830a-9d158b653cdb",
"xclip-large-patch14-16-frames": "https://drive.google.com/u/0/uc?id=1FOYgnJc097OJ4lGwtRCCydQyVPJEOH7d&amp;export=download&amp;confirm=t&amp;uuid=538fa810-e671-4050-b385-9a623f89804f",
# fully supervised kinetics-600 checkpoints
"xclip-base-patch16-kinetics-600": (
"https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/k600_16_8.pth"
),
"xclip-base-patch16-kinetics-600-16-frames": (
"https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/k600_16_16.pth"
),
"xclip-large-patch14-kinetics-600": "https://drive.google.com/u/0/uc?id=1FV8C1INuM91sLAN4ImjzePLIlpMSihwV&amp;export=download&amp;confirm=t&amp;uuid=141d4977-4a65-44ae-864f-4b0c19f838be",
# few shot
"xclip-base-patch16-hmdb-2-shot": (
"https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/few_hmdb_2.pth"
),
"xclip-base-patch16-hmdb-4-shot": (
"https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/few_hmdb_4.pth"
),
"xclip-base-patch16-hmdb-8-shot": (
"https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/few_hmdb_8.pth"
),
"xclip-base-patch16-hmdb-16-shot": (
"https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/few_hmdb_16.pth"
),
"xclip-base-patch16-ucf-2-shot": (
"https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/few_ucf_2.pth"
),
"xclip-base-patch16-ucf-4-shot": (
"https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/few_ucf_4.pth"
),
"xclip-base-patch16-ucf-8-shot": (
"https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/few_ucf_8.pth"
),
"xclip-base-patch16-ucf-16-shot": (
"https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/few_ucf_16.pth"
),
# zero shot
"xclip-base-patch16-zero-shot": "https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/zero.pth",
}
checkpoint_url = model_to_url[model_name]
num_frames = 8
if "16-frames" in model_name:
num_frames = 16
elif "shot" in model_name:
num_frames = 32
config = get_xclip_config(model_name, num_frames)
model = XCLIPModel(config)
model.eval()
if "drive" in checkpoint_url:
output = "pytorch_model.bin"
gdown.cached_download(checkpoint_url, output, quiet=False)
state_dict = torch.load(output, map_location="cpu")["model"]
else:
state_dict = torch.hub.load_state_dict_from_url(checkpoint_url)["model"]
state_dict = convert_state_dict(state_dict, config)
model = XCLIPModel(config)
missing_keys, unexpected_keys = model.load_state_dict(state_dict, strict=False)
assert missing_keys == ["text_model.embeddings.position_ids", "vision_model.embeddings.position_ids"]
model.eval()
size = 336 if model_name == "xclip-large-patch14-16-frames" else 224
feature_extractor = VideoMAEFeatureExtractor(size=size)
slow_tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32")
fast_tokenizer = CLIPTokenizerFast.from_pretrained("openai/clip-vit-base-patch32")
processor = XCLIPProcessor(feature_extractor=feature_extractor, tokenizer=fast_tokenizer)
video = prepare_video(num_frames)
inputs = processor(
text=["playing sports", "eating spaghetti", "go shopping"], videos=video, return_tensors="pt", padding=True
)
print("Shape of pixel values:", inputs.pixel_values.shape)
with torch.no_grad():
outputs = model(**inputs)
# Verify outputs
logits_per_video = outputs.logits_per_video
probs = logits_per_video.softmax(dim=1)
print("Probs:", probs)
# kinetics-400
if model_name == "xclip-base-patch32":
expected_probs = torch.tensor([[0.0019, 0.9951, 0.0030]])
elif model_name == "xclip-base-patch32-16-frames":
expected_probs = torch.tensor([[7.0999e-04, 9.9883e-01, 4.5580e-04]])
elif model_name == "xclip-base-patch16":
expected_probs = torch.tensor([[0.0083, 0.9681, 0.0236]])
elif model_name == "xclip-base-patch16-16-frames":
expected_probs = torch.tensor([[7.6937e-04, 9.9728e-01, 1.9473e-03]])
elif model_name == "xclip-large-patch14":
expected_probs = torch.tensor([[0.0062, 0.9864, 0.0075]])
elif model_name == "xclip-large-patch14-16-frames":
expected_probs = torch.tensor([[3.3877e-04, 9.9937e-01, 2.8888e-04]])
# kinetics-600
elif model_name == "xclip-base-patch16-kinetics-600":
expected_probs = torch.tensor([[0.0555, 0.8914, 0.0531]])
elif model_name == "xclip-base-patch16-kinetics-600-16-frames":
expected_probs = torch.tensor([[3.8554e-04, 9.9929e-01, 3.2754e-04]])
elif model_name == "xclip-large-patch14-kinetics-600":
expected_probs = torch.tensor([[0.0036, 0.9920, 0.0045]])
# few shot
elif model_name == "xclip-base-patch16-hmdb-2-shot":
expected_probs = torch.tensor([[7.1890e-06, 9.9994e-01, 5.6559e-05]])
elif model_name == "xclip-base-patch16-hmdb-4-shot":
expected_probs = torch.tensor([[1.0320e-05, 9.9993e-01, 6.2435e-05]])
elif model_name == "xclip-base-patch16-hmdb-8-shot":
expected_probs = torch.tensor([[4.1377e-06, 9.9990e-01, 9.8386e-05]])
elif model_name == "xclip-base-patch16-hmdb-16-shot":
expected_probs = torch.tensor([[4.1347e-05, 9.9962e-01, 3.3411e-04]])
elif model_name == "xclip-base-patch16-ucf-2-shot":
expected_probs = torch.tensor([[8.5857e-05, 9.9928e-01, 6.3291e-04]])
elif model_name == "xclip-base-patch16-ucf-4-shot":
expected_probs = torch.tensor([[8.5857e-05, 9.9928e-01, 6.3291e-04]])
elif model_name == "xclip-base-patch16-ucf-8-shot":
expected_probs = torch.tensor([[0.0027, 0.9904, 0.0070]])
elif model_name == "xclip-base-patch16-ucf-16-shot":
expected_probs = torch.tensor([[9.8219e-04, 9.9593e-01, 3.0863e-03]])
# zero shot
elif model_name == "xclip-base-patch16-zero-shot":
expected_probs = torch.tensor([[3.5082e-04, 9.9785e-01, 1.7966e-03]])
else:
raise ValueError(f"Model name {model_name} not supported")
assert torch.allclose(probs, expected_probs, atol=1e-3)
print("Looks ok!")
if pytorch_dump_folder_path is not None:
print(f"Saving model {model_name} to {pytorch_dump_folder_path}")
model.save_pretrained(pytorch_dump_folder_path)
if push_to_hub:
print("Pushing model, processor and slow tokenizer files to the hub...")
model.push_to_hub(model_name, organization="nielsr")
processor.push_to_hub(model_name, organization="nielsr")
slow_tokenizer.push_to_hub(model_name, organization="nielsr")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument(
"--model_name",
default="xclip-base-patch32",
type=str,
help="Name of the model.",
)
parser.add_argument(
"--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory."
)
parser.add_argument(
"--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
)
args = parser.parse_args()
convert_xclip_checkpoint(args.model_name, args.pytorch_dump_folder_path, args.push_to_hub)
This diff is collapsed.
# coding=utf-8
# Copyright 2022 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Image/Text processor class for XCLIP
"""
from ...processing_utils import ProcessorMixin
from ...tokenization_utils_base import BatchEncoding
class XCLIPProcessor(ProcessorMixin):
r"""
Constructs an X-CLIP processor which wraps a VideoMAE feature extractor and a CLIP tokenizer into a single
processor.
[`XCLIPProcessor`] offers all the functionalities of [`VideoMAEFeatureExtractor`] and [`CLIPTokenizerFast`]. See
the [`~XCLIPProcessor.__call__`] and [`~XCLIPProcessor.decode`] for more information.
Args:
feature_extractor ([`VideoMAEFeatureExtractor`]):
The feature extractor is a required input.
tokenizer ([`CLIPTokenizerFast`]):
The tokenizer is a required input.
"""
feature_extractor_class = "VideoMAEFeatureExtractor"
tokenizer_class = ("CLIPTokenizer", "CLIPTokenizerFast")
def __init__(self, feature_extractor, tokenizer):
super().__init__(feature_extractor, tokenizer)
self.current_processor = self.feature_extractor
def __call__(self, text=None, videos=None, return_tensors=None, **kwargs):
"""
Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
and `kwargs` arguments to CLIPTokenizerFast's [`~CLIPTokenizerFast.__call__`] if `text` is not `None` to encode
the text. To prepare the image(s), this method forwards the `videos` and `kwargs` arguments to
VideoMAEFeatureExtractor's [`~VideoMAEFeatureExtractor.__call__`] if `videos` is not `None`. Please refer to
the doctsring of the above two methods for more information.
Args:
text (`str`, `List[str]`, `List[List[str]]`):
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
videos (`List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`, `List[List[PIL.Image.Image]]`, `List[List[np.ndarrray]]`,:
`List[List[torch.Tensor]]`): The video or batch of videos to be prepared. Each video should be a list
of frames, which can be either PIL images or NumPy arrays. In case of NumPy arrays/PyTorch tensors,
each frame should be of shape (H, W, C), where H and W are frame height and width, and C is a number of
channels.
return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors of a particular framework. Acceptable values are:
- `'tf'`: Return TensorFlow `tf.constant` objects.
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return NumPy `np.ndarray` objects.
- `'jax'`: Return JAX `jnp.ndarray` objects.
Returns:
[`BatchEncoding`]: A [`BatchEncoding`] with the following fields:
- **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
`return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
`None`).
- **pixel_values** -- Pixel values to be fed to a model. Returned when `videos` is not `None`.
"""
if text is None and videos is None:
raise ValueError("You have to specify either text or videos. Both cannot be none.")
if text is not None:
encoding = self.tokenizer(text, return_tensors=return_tensors, **kwargs)
if videos is not None:
image_features = self.feature_extractor(videos, return_tensors=return_tensors, **kwargs)
if text is not None and videos is not None:
encoding["pixel_values"] = image_features.pixel_values
return encoding
elif text is not None:
return encoding
else:
return BatchEncoding(data=dict(**image_features), tensor_type=return_tensors)
def batch_decode(self, *args, **kwargs):
"""
This method forwards all its arguments to CLIPTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
refer to the docstring of this method for more information.
"""
return self.tokenizer.batch_decode(*args, **kwargs)
def decode(self, *args, **kwargs):
"""
This method forwards all its arguments to CLIPTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
the docstring of this method for more information.
"""
return self.tokenizer.decode(*args, **kwargs)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment