Add CLVP (#24745)

* init commit * attention arch done except rotary emb * rotary emb done * text encoder working * outputs matching * arch first pass done * make commands done, tests and docs remaining * all tests passed, only docs remaining * docs done * doc-builder fix * convert script removed(not relevant) * minor comments done * added ckpt conversion script * tokenizer done * very minor fix of index.md 2 * mostly make fixup related * all done except fe and rotary emb * very small change * removed unidecode dependency * style changes * tokenizer removed require_backends * added require_inflect to tokenizer tests * removed VOCAB_FILES in tokenizer test * inflect dependency removed * added rotary pos emb cache and simplified the apply method * style * little doc change * more comments * feature extractor added * added processor * auto-regressive config added * added CLVPConditioningEncoder * comments done except the test one * weights added successfull(NOT tested) * tokenizer fix with numbers * generate outputs matching * almost tests passing Integ tests not written * Integ tests added * major CUDA error fixed * docs done * rebase and multiple fixes * fixed rebase overwrites * generate code simplified and tests for AutoRegressive model added * minor changes * refectored gpt2 code in clvp file * weights done and all code refactored * mostly done except the fast_tokenizer * doc test fix * config file's doc fixes * more config fix * more comments * tokenizer comments mostly done * modeling file mostly refactored and can load modules * ClvpEncoder tested * ClvpDecoder, ClvpModel and ClvpForCausalLM tested * integration and all tests passed * more fixes * docs almost done * ckpt conversion refectored * style and some failing tests fix * comments * temporary output fix but test_assisted_decoding_matches_greedy_search test fails * majority changes done * use_cache outputs same now! Along with the asisted_greedy_decoding test fix * more comments * more comments * prepare_inputs_for_generation fixed and _prepare_model_inputs added * style fix * clvp.md change * moved clvpconditionalencoder norms * add model to new index * added tokenizer input_ids_with_special_tokens * small fix * config mostly done * added config-tester and changed conversion script * more comments * comments * style fix * some comments * tokenizer changed back to prev state * small commnets * added output hidden states for the main model * style fix * comments * small change * revert small change * . * Update clvp.md * Update test_modeling_clvp.py * :) * some minor change * new fixes * remove to_dict from FE

Add CLVP (#24745)
* init commit * attention arch done except rotary emb * rotary emb done * text encoder working * outputs matching * arch first pass done * make commands done, tests and docs remaining * all tests passed, only docs remaining * docs done * doc-builder fix * convert script removed(not relevant) * minor comments done * added ckpt conversion script * tokenizer done * very minor fix of index.md 2 * mostly make fixup related * all done except fe and rotary emb * very small change * removed unidecode dependency * style changes * tokenizer removed require_backends * added require_inflect to tokenizer tests * removed VOCAB_FILES in tokenizer test * inflect dependency removed * added rotary pos emb cache and simplified the apply method * style * little doc change * more comments * feature extractor added * added processor * auto-regressive config added * added CLVPConditioningEncoder * comments done except the test one * weights added successfull(NOT tested) * tokenizer fix with numbers * generate outputs matching * almost tests passing Integ tests not written * Integ tests added * major CUDA error fixed * docs done * rebase and multiple fixes * fixed rebase overwrites * generate code simplified and tests for AutoRegressive model added * minor changes * refectored gpt2 code in clvp file * weights done and all code refactored * mostly done except the fast_tokenizer * doc test fix * config file's doc fixes * more config fix * more comments * tokenizer comments mostly done * modeling file mostly refactored and can load modules * ClvpEncoder tested * ClvpDecoder, ClvpModel and ClvpForCausalLM tested * integration and all tests passed * more fixes * docs almost done * ckpt conversion refectored * style and some failing tests fix * comments * temporary output fix but test_assisted_decoding_matches_greedy_search test fails * majority changes done * use_cache outputs same now! Along with the asisted_greedy_decoding test fix * more comments * more comments * prepare_inputs_for_generation fixed and _prepare_model_inputs added * style fix * clvp.md change * moved clvpconditionalencoder norms * add model to new index * added tokenizer input_ids_with_special_tokens * small fix * config mostly done * added config-tester and changed conversion script * more comments * comments * style fix * some comments * tokenizer changed back to prev state * small commnets * added output hidden states for the main model * style fix * comments * small change * revert small change * . * Update clvp.md * Update test_modeling_clvp.py * :) * some minor change * new fixes * remove to_dict from FE
7e9f10ac · Susnato Dhar · GitHub · 9dd58c53 · 7e9f10ac · 7e9f10ac
Unverified Commit 7e9f10ac authored Nov 10, 2023 by Susnato Dhar Committed by GitHub Nov 10, 2023
20 changed files
--- a/README.md
+++ b/README.md
@@ -321,6 +321,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
 1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
+1. **[CLVP](https://huggingface.co/docs/transformers/main/model_doc/clvp)** released with the paper [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243) by James Betker.
 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
 1. **[CodeLlama](https://huggingface.co/docs/transformers/model_doc/llama_code)** (from MetaAI) released with the paper [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) by Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve.
 1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (from Microsoft Research Asia) released with the paper [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) by Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang.

--- a/README_es.md
+++ b/README_es.md
@@ -296,6 +296,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
 1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
+1. **[CLVP](https://huggingface.co/docs/transformers/main/model_doc/clvp)** released with the paper [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243) by James Betker. 
 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
 1. **[CodeLlama](https://huggingface.co/docs/transformers/model_doc/llama_code)** (from MetaAI) released with the paper [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) by Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve.
 1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (from Microsoft Research Asia) released with the paper [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) by Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang.

--- a/README_hd.md
+++ b/README_hd.md
@@ -270,6 +270,7 @@ conda install -c huggingface transformers
 1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (LAION-AI से) Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov. द्वाराअनुसंधान पत्र [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687) के साथ जारी किया गया
 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (OpenAI से) साथ वाला पेपर [लर्निंग ट्रांसफरेबल विजुअल मॉडल फ्रॉम नेचुरल लैंग्वेज सुपरविजन](https://arxiv.org /abs/2103.00020) एलेक रैडफोर्ड, जोंग वूक किम, क्रिस हैलासी, आदित्य रमेश, गेब्रियल गोह, संध्या अग्रवाल, गिरीश शास्त्री, अमांडा एस्केल, पामेला मिश्किन, जैक क्लार्क, ग्रेचेन क्रुएगर, इल्या सुत्स्केवर द्वारा।
 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
+1. **[CLVP](https://huggingface.co/docs/transformers/main/model_doc/clvp)** released with the paper [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243) by James Betker. 
 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (सेल्सफोर्स से) साथ में पेपर [प्रोग्राम सिंथेसिस के लिए एक संवादात्मक प्रतिमान](https://arxiv.org/abs/2203.13474) एरिक निजकैंप, बो पैंग, हिरोआकी हयाशी, लिफू तू, हुआन वांग, यिंगबो झोउ, सिल्वियो सावरेस, कैमिंग जिओंग रिलीज।
 1. **[CodeLlama](https://huggingface.co/docs/transformers/model_doc/llama_code)** (MetaAI से) Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve. द्वाराअनुसंधान पत्र [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) के साथ जारी किया गया
 1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (माइक्रोसॉफ्ट रिसर्च एशिया से) कागज के साथ [फास्ट ट्रेनिंग कन्वर्जेंस के लिए सशर्त डीईटीआर](https://arxiv. org/abs/2108.06152) डेपू मेंग, ज़ियाओकांग चेन, ज़ेजिया फैन, गैंग ज़ेंग, होउकियांग ली, युहुई युआन, लेई सन, जिंगडोंग वांग द्वारा।

--- a/README_ja.md
+++ b/README_ja.md
@@ -330,6 +330,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
 1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (LAION-AI から) Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov. から公開された研究論文 [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687)
 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (OpenAI から) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever から公開された研究論文: [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020)
 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (University of Göttingen から) Timo Lüddecke and Alexander Ecker から公開された研究論文: [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003)
+1. **[CLVP](https://huggingface.co/docs/transformers/main/model_doc/clvp)** released with the paper [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243) by James Betker. 
 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (Salesforce から) Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong から公開された研究論文: [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474)
 1. **[CodeLlama](https://huggingface.co/docs/transformers/model_doc/llama_code)** (MetaAI から) Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve. から公開された研究論文 [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/)
 1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (Microsoft Research Asia から) Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang から公開された研究論文: [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152)

--- a/README_ko.md
+++ b/README_ko.md
@@ -245,6 +245,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
 1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (LAION-AI 에서 제공)은 Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.의 [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687)논문과 함께 발표했습니다.
 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (OpenAI 에서) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever 의 [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) 논문과 함께 발표했습니다.
 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (University of Göttingen 에서) Timo Lüddecke and Alexander Ecker 의 [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) 논문과 함께 발표했습니다.
+1. **[CLVP](https://huggingface.co/docs/transformers/main/model_doc/clvp)** released with the paper [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243) by James Betker. 
 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (Salesforce 에서) Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong 의 [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) 논문과 함께 발표했습니다.
 1. **[CodeLlama](https://huggingface.co/docs/transformers/model_doc/llama_code)** (MetaAI 에서 제공)은 Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve.의 [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/)논문과 함께 발표했습니다.
 1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (Microsoft Research Asia 에서) Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang 의 [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) 논문과 함께 발표했습니다.

--- a/README_zh-hans.md
+++ b/README_zh-hans.md
@@ -269,6 +269,7 @@ conda install -c huggingface transformers
 1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (来自 LAION-AI) 伴随论文 [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687) 由 Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov 发布。
 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (来自 OpenAI) 伴随论文 [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) 由 Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever 发布。
 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (来自 University of Göttingen) 伴随论文 [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) 由 Timo Lüddecke and Alexander Ecker 发布。
+1. **[CLVP](https://huggingface.co/docs/transformers/main/model_doc/clvp)** released with the paper [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243) by James Betker. 
 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (来自 Salesforce) 伴随论文 [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) 由 Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong 发布。
 1. **[CodeLlama](https://huggingface.co/docs/transformers/model_doc/llama_code)** (来自 MetaAI) 伴随论文 [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) 由 Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve 发布。
 1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (来自 Microsoft Research Asia) 伴随论文 [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) 由 Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang 发布。

--- a/README_zh-hant.md
+++ b/README_zh-hant.md
@@ -281,6 +281,7 @@ conda install -c huggingface transformers
 1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
+1. **[CLVP](https://huggingface.co/docs/transformers/main/model_doc/clvp)** released with the paper [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243) by James Betker. 
 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
 1. **[CodeLlama](https://huggingface.co/docs/transformers/model_doc/llama_code)** (from MetaAI) released with the paper [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) by Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve.
 1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (from Microsoft Research Asia) released with the paper [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) by Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang.

--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -663,6 +663,8 @@
        title: CLIP
      - local: model_doc/clipseg
        title: CLIPSeg
+      - local: model_doc/clvp
+        title: CLVP
      - local: model_doc/data2vec
        title: Data2Vec
      - local: model_doc/deplot

--- a/docs/source/en/index.md
+++ b/docs/source/en/index.md
@@ -92,6 +92,7 @@ Flax), PyTorch, and/or TensorFlow.
 |                          [CLAP](model_doc/clap)                          |       ✅        |         ❌         |      ❌      |
 |                          [CLIP](model_doc/clip)                          |       ✅        |         ✅         |      ✅      |
 |                       [CLIPSeg](model_doc/clipseg)                       |       ✅        |         ❌         |      ❌      |
+|                          [CLVP](model_doc/clvp)                          |       ✅        |         ❌         |      ❌      |
 |                       [CodeGen](model_doc/codegen)                       |       ✅        |         ❌         |      ❌      |
 |                    [CodeLlama](model_doc/code_llama)                     |       ✅        |         ❌         |      ❌      |
 |              [Conditional DETR](model_doc/conditional_detr)              |       ✅        |         ❌         |      ❌      |

--- a/docs/source/en/model_doc/clvp.md
+++ b/docs/source/en/model_doc/clvp.md
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+-->
+# CLVP
+## Overview
+The CLVP (Contrastive Language-Voice Pretrained Transformer) model was proposed in [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243) by James Betker.
+The abstract from the paper is the following:
+*In recent years, the field of image generation has been revolutionized by the application of autoregressive transformers and DDPMs. These approaches model the process of image generation as a step-wise probabilistic processes and leverage large amounts of compute and data to learn the image distribution. This methodology of improving performance need not be confined to images. This paper describes a way to apply advances in the image generative domain to speech synthesis. The result is TorToise - an expressive, multi-voice text-to-speech system.*
+This model was contributed by [Susnato Dhar](https://huggingface.co/susnato).
+The original code can be found [here](https://github.com/neonbjb/tortoise-tts).
+## Usage tips
+1. CLVP is an integral part of the Tortoise TTS model.
+2. CLVP can be used to compare different generated speech candidates with the provided text, and the best speech tokens are forwarded to the diffusion model.
+3. The use of the [`ClvpModelForConditionalGeneration.generate()`] method is strongly recommended for tortoise usage.
+4. Note that the CLVP model expects the audio to be sampled at 22.05 kHz contrary to other audio models which expects 16 kHz. 
+## Brief Explanation:
+- The [`ClvpTokenizer`] tokenizes the text input, and the [`ClvpFeatureExtractor`] extracts the log mel-spectrogram from the desired audio.
+- [`ClvpConditioningEncoder`] takes those text tokens and audio representations and converts them into embeddings conditioned on the text and audio.
+- The [`ClvpForCausalLM`] uses those embeddings to generate multiple speech candidates.
+- Each speech candidate is passed through the speech encoder ([`ClvpEncoder`]) which converts them into a vector representation, and the text encoder ([`ClvpEncoder`]) converts the text tokens into the same latent space. 
+- At the end, we compare each speech vector with the text vector to see which speech vector is most similar to the text vector. 
+- [`ClvpModelForConditionalGeneration.generate()`] compresses all of the logic described above into a single method.  
+Example :
+```python
+>>> import datasets
+>>> from transformers import ClvpProcessor, ClvpModelForConditionalGeneration
+>>> # Define the Text and Load the Audio (We are taking an audio example from HuggingFace Hub using `datasets` library).
+>>> text = "This is an example text."
+>>> ds = datasets.load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+>>> ds = ds.cast_column("audio", datasets.Audio(sampling_rate=22050))
+>>> sample = ds[0]["audio"]
+>>> # Define processor and model.
+>>> processor = ClvpProcessor.from_pretrained("susnato/clvp_dev")
+>>> model = ClvpModelForConditionalGeneration.from_pretrained("susnato/clvp_dev")
+>>> # Generate processor output and model output.
+>>> processor_output = processor(raw_speech=sample["array"], sampling_rate=sample["sampling_rate"], text=text, return_tensors="pt")
+>>> generated_output = model.generate(**processor_output)
+```
+## ClvpConfig
+[[autodoc]] ClvpConfig
+    - from_sub_model_configs
+## ClvpEncoderConfig
+[[autodoc]] ClvpEncoderConfig
+## ClvpDecoderConfig
+[[autodoc]] ClvpDecoderConfig
+## ClvpTokenizer
+[[autodoc]] ClvpTokenizer
+    - save_vocabulary
+## ClvpFeatureExtractor
+[[autodoc]] ClvpFeatureExtractor
+    - __call__
+## ClvpProcessor
+[[autodoc]] ClvpProcessor
+    - __call__
+    - decode
+    - batch_decode
+## ClvpModelForConditionalGeneration
+[[autodoc]] ClvpModelForConditionalGeneration
+    - forward
+    - generate
+    - get_text_features
+    - get_speech_features
+## ClvpForCausalLM
+[[autodoc]] ClvpForCausalLM
+## ClvpModel
+[[autodoc]] ClvpModel
+## ClvpEncoder
+[[autodoc]] ClvpEncoder
+## ClvpDecoder
+[[autodoc]] ClvpDecoder
--- a/src/transformers/__init__.py
+++ b/src/transformers/__init__.py
@@ -256,6 +256,15 @@ _import_structure = {
        "CLIPSegTextConfig",
        "CLIPSegVisionConfig",
    ],
+    "models.clvp": [
+        "CLVP_PRETRAINED_CONFIG_ARCHIVE_MAP",
+        "ClvpConfig",
+        "ClvpDecoderConfig",
+        "ClvpEncoderConfig",
+        "ClvpFeatureExtractor",
+        "ClvpProcessor",
+        "ClvpTokenizer",
+    ],
    "models.code_llama": [],
    "models.codegen": ["CODEGEN_PRETRAINED_CONFIG_ARCHIVE_MAP", "CodeGenConfig", "CodeGenTokenizer"],
    "models.conditional_detr": ["CONDITIONAL_DETR_PRETRAINED_CONFIG_ARCHIVE_MAP", "ConditionalDetrConfig"],
@@ -1458,6 +1467,17 @@ else:
            "CLIPSegVisionModel",
        ]
    )
+    _import_structure["models.clvp"].extend(
+        [
+            "CLVP_PRETRAINED_MODEL_ARCHIVE_LIST",
+            "ClvpDecoder",
+            "ClvpEncoder",
+            "ClvpForCausalLM",
+            "ClvpModel",
+            "ClvpModelForConditionalGeneration",
+            "ClvpPreTrainedModel",
+        ]
+    )
    _import_structure["models.codegen"].extend(
        [
            "CODEGEN_PRETRAINED_MODEL_ARCHIVE_LIST",
@@ -4446,6 +4466,15 @@ if TYPE_CHECKING:
        CLIPSegTextConfig,
        CLIPSegVisionConfig,
    )
+    from .models.clvp import (
+        CLVP_PRETRAINED_CONFIG_ARCHIVE_MAP,
+        ClvpConfig,
+        ClvpDecoderConfig,
+        ClvpEncoderConfig,
+        ClvpFeatureExtractor,
+        ClvpProcessor,
+        ClvpTokenizer,
+    )
    from .models.codegen import CODEGEN_PRETRAINED_CONFIG_ARCHIVE_MAP, CodeGenConfig, CodeGenTokenizer
    from .models.conditional_detr import CONDITIONAL_DETR_PRETRAINED_CONFIG_ARCHIVE_MAP, ConditionalDetrConfig
    from .models.convbert import CONVBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, ConvBertConfig, ConvBertTokenizer
@@ -5516,6 +5545,15 @@ if TYPE_CHECKING:
            CLIPSegTextModel,
            CLIPSegVisionModel,
        )
+        from .models.clvp import (
+            CLVP_PRETRAINED_MODEL_ARCHIVE_LIST,
+            ClvpDecoder,
+            ClvpEncoder,
+            ClvpForCausalLM,
+            ClvpModel,
+            ClvpModelForConditionalGeneration,
+            ClvpPreTrainedModel,
+        )
        from .models.codegen import (
            CODEGEN_PRETRAINED_MODEL_ARCHIVE_LIST,
            CodeGenForCausalLM,

--- a/src/transformers/models/__init__.py
+++ b/src/transformers/models/__init__.py
@@ -46,6 +46,7 @@ from . import (
    clap,
    clip,
    clipseg,
+    clvp,
    code_llama,
    codegen,
    conditional_detr,

--- a/src/transformers/models/auto/configuration_auto.py
+++ b/src/transformers/models/auto/configuration_auto.py
@@ -57,6 +57,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
        ("clap", "ClapConfig"),
        ("clip", "CLIPConfig"),
        ("clipseg", "CLIPSegConfig"),
+        ("clvp", "ClvpConfig"),
        ("code_llama", "LlamaConfig"),
        ("codegen", "CodeGenConfig"),
        ("conditional_detr", "ConditionalDetrConfig"),
@@ -276,6 +277,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
        ("clap", "CLAP_PRETRAINED_MODEL_ARCHIVE_LIST"),
        ("clip", "CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("clipseg", "CLIPSEG_PRETRAINED_CONFIG_ARCHIVE_MAP"),
+        ("clvp", "CLVP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("codegen", "CODEGEN_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("conditional_detr", "CONDITIONAL_DETR_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("convbert", "CONVBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
@@ -481,6 +483,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
        ("clap", "CLAP"),
        ("clip", "CLIP"),
        ("clipseg", "CLIPSeg"),
+        ("clvp", "CLVP"),
        ("code_llama", "CodeLlama"),
        ("codegen", "CodeGen"),
        ("conditional_detr", "Conditional DETR"),

--- a/src/transformers/models/auto/feature_extraction_auto.py
+++ b/src/transformers/models/auto/feature_extraction_auto.py
@@ -44,6 +44,7 @@ FEATURE_EXTRACTOR_MAPPING_NAMES = OrderedDict(
        ("clap", "ClapFeatureExtractor"),
        ("clip", "CLIPFeatureExtractor"),
        ("clipseg", "ViTFeatureExtractor"),
+        ("clvp", "ClvpFeatureExtractor"),
        ("conditional_detr", "ConditionalDetrFeatureExtractor"),
        ("convnext", "ConvNextFeatureExtractor"),
        ("cvt", "ConvNextFeatureExtractor"),

--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@@ -55,6 +55,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
        ("clap", "ClapModel"),
        ("clip", "CLIPModel"),
        ("clipseg", "CLIPSegModel"),
+        ("clvp", "ClvpModelForConditionalGeneration"),
        ("code_llama", "LlamaModel"),
        ("codegen", "CodeGenModel"),
        ("conditional_detr", "ConditionalDetrModel"),

--- a/src/transformers/models/auto/processing_auto.py
+++ b/src/transformers/models/auto/processing_auto.py
@@ -53,6 +53,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
        ("clap", "ClapProcessor"),
        ("clip", "CLIPProcessor"),
        ("clipseg", "CLIPSegProcessor"),
+        ("clvp", "ClvpProcessor"),
        ("flava", "FlavaProcessor"),
        ("fuyu", "FuyuProcessor"),
        ("git", "GitProcessor"),

--- a/src/transformers/models/auto/tokenization_auto.py
+++ b/src/transformers/models/auto/tokenization_auto.py
@@ -121,6 +121,7 @@ else:
                    "CLIPTokenizerFast" if is_tokenizers_available() else None,
                ),
            ),
+            ("clvp", ("ClvpTokenizer", None)),
            (
                "code_llama",
                (

--- a/src/transformers/models/clvp/__init__.py
+++ b/src/transformers/models/clvp/__init__.py
+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+from ...utils import (
+    OptionalDependencyNotAvailable,
+    _LazyModule,
+    is_torch_available,
+)
+_import_structure = {
+    "configuration_clvp": [
+        "CLVP_PRETRAINED_CONFIG_ARCHIVE_MAP",
+        "ClvpConfig",
+        "ClvpDecoderConfig",
+        "ClvpEncoderConfig",
+    ],
+    "feature_extraction_clvp": ["ClvpFeatureExtractor"],
+    "processing_clvp": ["ClvpProcessor"],
+    "tokenization_clvp": ["ClvpTokenizer"],
+}
+try:
+    if not is_torch_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["modeling_clvp"] = [
+        "CLVP_PRETRAINED_MODEL_ARCHIVE_LIST",
+        "ClvpModelForConditionalGeneration",
+        "ClvpForCausalLM",
+        "ClvpModel",
+        "ClvpPreTrainedModel",
+        "ClvpEncoder",
+        "ClvpDecoder",
+    ]
+if TYPE_CHECKING:
+    from .configuration_clvp import (
+        CLVP_PRETRAINED_CONFIG_ARCHIVE_MAP,
+        ClvpConfig,
+        ClvpDecoderConfig,
+        ClvpEncoderConfig,
+    )
+    from .feature_extraction_clvp import ClvpFeatureExtractor
+    from .processing_clvp import ClvpProcessor
+    from .tokenization_clvp import ClvpTokenizer
+    try:
+        if not is_torch_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .modeling_clvp import (
+            CLVP_PRETRAINED_MODEL_ARCHIVE_LIST,
+            ClvpDecoder,
+            ClvpEncoder,
+            ClvpForCausalLM,
+            ClvpModel,
+            ClvpModelForConditionalGeneration,
+            ClvpPreTrainedModel,
+        )
+else:
+    import sys
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
--- a/src/transformers/models/clvp/configuration_clvp.py
+++ b/src/transformers/models/clvp/configuration_clvp.py
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" CLVP model configuration"""
+import os
+from typing import TYPE_CHECKING, Union
+if TYPE_CHECKING:
+    pass
+from ...configuration_utils import PretrainedConfig
+from ...utils import logging
+logger = logging.get_logger(__name__)
+CLVP_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "susnato/clvp_dev": "https://huggingface.co/susnato/clvp_dev/resolve/main/config.json",
+}
+class ClvpEncoderConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`ClvpEncoder`]. It is used to instantiate a CLVP
+    text or CLVP speech encoder according to the specified arguments. Instantiating a configuration with the defaults
+    will yield a similar configuration to that of the encoder of the CLVP
+    [susnato/clvp_dev](https://huggingface.co/susnato/clvp_dev) architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 256):
+            Vocabulary size of the CLVP Encoder model.
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        intermediate_size (`int`, *optional*, defaults to 1536):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        projection_dim (`int`, *optional*, defaults to 768):
+            Dimensionality of the projection vector.
+        num_hidden_layers (`int`, *optional*, defaults to 20):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"selu"` and `"gelu_new"` `"quick_gelu"` are supported.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-05):
+            The epsilon used by the layer normalization layers.
+        attention_dropout (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        dropout (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the feed-forward layers in [`ClvpEncoderMLP`].
+        use_rotary_embedding (`bool`, *optional*, defaults to `True`):
+            Whether to use rotary_embedding or not.
+        use_attention_bias (`bool`, *optional*, defaults to `False`):
+            Whether to use bias in Query, Key and Value layers during self attention.
+        summary_type (`str`, *optional*, defaults to `"mean"`):
+            What strategy to use to get pooler_output from the last_hidden_state. `"last"`, `"first"`, `"mean"` and
+            `"cls_index"` are supported.
+        initializer_factor (`float`, *optional*, defaults to 1.0):
+            A factor for initializing all weight matrices (should be kept to 1.0, used internally for initialization
+            testing).
+        bos_token_id (`int`, *optional*, defaults to 255):
+            Beginning of sequence token id.
+        eos_token_id (`int`, *optional*, defaults to 0):
+            End of sequence token id.
+    Example:
+    ```python
+    >>> from transformers import ClvpEncoderConfig, ClvpEncoder
+    >>> # Initializing a ClvpEncoderConfig with susnato/clvp_dev style configuration
+    >>> encoder_configuration = ClvpEncoderConfig()
+    >>> # Initializing a ClvpEncoder (with random weights) from the susnato/clvp_dev style configuration
+    >>> model = ClvpEncoder(encoder_configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "clvp_encoder"
+    def __init__(
+        self,
+        vocab_size=256,
+        hidden_size=768,
+        intermediate_size=1536,
+        projection_dim=768,
+        num_hidden_layers=20,
+        num_attention_heads=12,
+        hidden_act="gelu",
+        layer_norm_eps=1e-5,
+        attention_dropout=0.1,
+        dropout=0.1,
+        use_rotary_embedding=True,
+        use_attention_bias=False,
+        summary_type="mean",
+        initializer_factor=1.0,
+        bos_token_id=255,
+        eos_token_id=0,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.projection_dim = projection_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.layer_norm_eps = layer_norm_eps
+        self.hidden_act = hidden_act
+        self.initializer_factor = initializer_factor
+        self.attention_dropout = attention_dropout
+        self.dropout = dropout
+        self.use_rotary_embedding = use_rotary_embedding
+        self.use_attention_bias = use_attention_bias
+        self.summary_type = summary_type
+        self.bos_token_id = bos_token_id
+        self.eos_token_id = eos_token_id
+        super().__init__(bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
+    @classmethod
+    def from_pretrained(
+        cls, pretrained_model_name_or_path: Union[str, os.PathLike], config_type: str = "text_config", **kwargs
+    ) -> "PretrainedConfig":
+        cls._set_token_in_kwargs(kwargs)
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+        # make sure to have the config_type be either "text_config" or "speech_config"
+        # this is to make sure that we can load only text or speech configs from the nested ClvpConfig.
+        if config_type not in ["text_config", "speech_config"]:
+            raise ValueError(
+                f"We can only load either 'text_config' or 'speech_config' but you are trying to load" f"{config_type}"
+            )
+        # get the text config dict if we are loading from ClvpConfig
+        if config_dict.get("model_type") == "clvp":
+            config_dict = config_dict[config_type]
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+        return cls.from_dict(config_dict, **kwargs)
+class ClvpDecoderConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`ClvpDecoder`]. It is used to instantiate a CLVP
+    Decoder Model according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of the Decoder part of the CLVP
+    [susnato/clvp_dev](https://huggingface.co/susnato/clvp_dev) architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    The architecture is similar to GPT2.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 8194):
+            Vocabulary size of the model.
+        max_position_embeddings (`int`, *optional*, defaults to 608):
+            The maximum sequence length of mel tokens that this model might ever be used with. Similar to `n_positions`
+            in `GPT2Config`.
+        max_text_tokens (`int`, *optional*, defaults to 404):
+            The maximum sequence length of text tokens that this model might ever be used with. Similar to
+            `n_positions` in `GPT2Config`.
+        hidden_size (`int`, *optional*, defaults to 1024):
+            Dimensionality of the embeddings and hidden states.
+        num_hidden_layers (`int`, *optional*, defaults to 30):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 16):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        n_inner (`int`, *optional*):
+            Dimensionality of the inner feed-forward layers. `None` will set it to 4 times `hidden_size`.
+        num_mel_attn_blocks (`int`, *optional*, defaults to 6):
+            Denotes the number of self attention layers in [`ClvpConditioningEncoder`].
+        activation_function (`str`, *optional*, defaults to `"gelu_new"`):
+            Activation function, to be selected in the list `["relu", "silu", "gelu", "tanh", "gelu_new"]`.
+        resid_pdrop (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        embd_pdrop (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the embeddings.
+        attention_dropout (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention.
+        layer_norm_epsilon (`float`, *optional*, defaults to 1e-05):
+            The epsilon to use in the layer normalization layers.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        summary_type (`string`, *optional*, defaults to `"cls_index"`):
+            Argument used when doing sequence summary.
+            Has to be one of the following options:
+                - `"last"`: Take the last token hidden state (like XLNet).
+                - `"first"`: Take the first token hidden state (like BERT).
+                - `"mean"`: Take the mean of all tokens hidden states.
+                - `"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
+                - `"attn"`: Not implemented now, use multi-head attention.
+        summary_use_proj (`bool`, *optional*, defaults to `True`):
+            Whether or not to add a projection after the vector extraction.
+        summary_activation (`str`, *optional*):
+            Pass `"tanh"` for a tanh activation to the output, any other value will result in no activation.
+        summary_proj_to_labels (`bool`, *optional*, defaults to `True`):
+            Whether the projection outputs should have `config.num_labels` or `config.hidden_size` classes.
+        summary_first_dropout (`float`, *optional*, defaults to 0.1):
+            The dropout ratio to be used after the projection and activation.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models).
+        bos_token_id (`int`, *optional*, defaults to 8192):
+            Beginning of sequence token id, used at the start of the generation.
+        eos_token_id (`int`, *optional*, defaults to 8193):
+            End of sequence token id, used in the method
+            [`ClvpModelForConditionalGeneration.fix_speech_decoder_output()`] to correct decoder outputs.
+        feature_size (`int`, *optional*, defaults to 80):
+            The feature dimension of the extracted mel features. This value is used in [`ClvpConditioningEncoder`].
+        use_attention_bias (`bool`, *optional*, defaults to `True`):
+            Whether to use bias in Query, Key and Value layers during self attention.
+        initializer_factor (`float`, *optional*, defaults to 1.0):
+            A factor for initializing all weight matrices (should be kept to 1.0, used internally for initialization
+            testing).
+        decoder_fixing_codes (`list`, *optional*, defaults to `[83, 45, 45, 248]`):
+            These values are used in the method `fix_speech_decoder_output` to fix decoder generated outputs.
+    Example:
+    ```python
+    >>> from transformers import ClvpDecoderConfig, ClvpDecoder
+    >>> # Initializing a ClvpDecoderConfig with susnato/clvp_dev style configuration
+    >>> decoder_configuration = ClvpDecoderConfig()
+    >>> # Initializing a ClvpDecoder (with random weights) from the susnato/clvp_dev style configuration
+    >>> model = ClvpDecoder(decoder_configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "clvp_decoder"
+    def __init__(
+        self,
+        vocab_size=8194,
+        max_position_embeddings=608,
+        max_text_tokens=404,
+        hidden_size=1024,
+        num_hidden_layers=30,
+        num_attention_heads=16,
+        n_inner=None,
+        num_mel_attn_blocks=6,
+        activation_function="gelu_new",
+        resid_pdrop=0.1,
+        embd_pdrop=0.1,
+        attention_dropout=0.1,
+        layer_norm_epsilon=1e-5,
+        initializer_range=0.02,
+        summary_type="cls_index",
+        summary_use_proj=True,
+        summary_activation=None,
+        summary_proj_to_labels=True,
+        summary_first_dropout=0.1,
+        use_cache=True,
+        bos_token_id=8192,
+        eos_token_id=8193,
+        feature_size=80,
+        use_attention_bias=True,
+        initializer_factor=1.0,
+        decoder_fixing_codes=[83, 45, 45, 248],
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.max_text_tokens = max_text_tokens
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.n_inner = n_inner
+        self.num_mel_attn_blocks = num_mel_attn_blocks
+        self.activation_function = activation_function
+        self.resid_pdrop = resid_pdrop
+        self.embd_pdrop = embd_pdrop
+        self.attention_dropout = attention_dropout
+        self.layer_norm_epsilon = layer_norm_epsilon
+        self.initializer_range = initializer_range
+        self.summary_type = summary_type
+        self.summary_use_proj = summary_use_proj
+        self.summary_activation = summary_activation
+        self.summary_first_dropout = summary_first_dropout
+        self.summary_proj_to_labels = summary_proj_to_labels
+        self.use_cache = use_cache
+        self.feature_size = feature_size
+        self.use_attention_bias = use_attention_bias
+        self.initializer_factor = initializer_factor
+        self.decoder_fixing_codes = decoder_fixing_codes
+        self.bos_token_id = bos_token_id
+        self.eos_token_id = eos_token_id
+        super().__init__(bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
+        cls._set_token_in_kwargs(kwargs)
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+        # get the speech config dict if we are loading from ClvpConfig
+        if config_dict.get("model_type") == "clvp":
+            config_dict = config_dict["decoder_config"]
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+        return cls.from_dict(config_dict, **kwargs)
+class ClvpConfig(PretrainedConfig):
+    r"""
+    [`ClvpConfig`] is the configuration class to store the configuration of a [`ClvpModelForConditionalGeneration`]. It
+    is used to instantiate a CLVP model according to the specified arguments, defining the text model, speech model and
+    decoder model configs. Instantiating a configuration with the defaults will yield a similar configuration to that
+    of the CLVP [susnato/clvp_dev](https://huggingface.co/susnato/clvp_dev) architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        text_config (`dict`, *optional*):
+            Dictionary of configuration options used to initialize the CLVP text encoder.
+        speech_config (`dict`, *optional*):
+            Dictionary of configuration options used to initialize CLVP speech encoder.
+        decoder_config (`dict`, *optional*):
+            Dictionary of configuration options used to initialize [`ClvpDecoderConfig`].
+        projection_dim (`int`, *optional*, defaults to 768):
+            Dimentionality of text and speech projection layers.
+        logit_scale_init_value (`float`, *optional*, defaults to 2.6592):
+            The inital value of the *logit_scale* paramter. Default is used as per the original CLVP implementation.
+        initializer_factor (`float`, *optional*, defaults to 1.0):
+            A factor for initializing all weight matrices (should be kept to 1.0, used internally for initialization
+            testing).
+        kwargs (*optional*):
+            Dictionary of keyword arguments.
+    Example:
+    ```python
+    >>> from transformers import ClvpConfig, ClvpModelForConditionalGeneration
+    >>> # Initializing a ClvpConfig with susnato/clvp_dev style configuration
+    >>> configuration = ClvpConfig()
+    >>> # Initializing a ClvpModelForConditionalGeneration (with random weights) from the susnato/clvp_dev style configuration
+    >>> model = ClvpModelForConditionalGeneration(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    >>> # We can also initialize a CLVPConfig from a CLVPTextConfig, CLVPSpeechConfig and a CLVPAutoRegressiveConfig
+    >>> from transformers import ClvpEncoderConfig, ClvpDecoderConfig
+    >>> # Initializing a CLVP text, CLVP speech and CLVP decoder configuration
+    >>> config_text = ClvpEncoderConfig()
+    >>> config_speech = ClvpEncoderConfig()
+    >>> decoder_config = ClvpDecoderConfig()
+    >>> config = ClvpConfig.from_sub_model_configs(config_text, config_speech, decoder_config)
+    ```"""
+    model_type = "clvp"
+    is_composition = True
+    def __init__(
+        self,
+        text_config=None,
+        speech_config=None,
+        decoder_config=None,
+        projection_dim=768,
+        logit_scale_init_value=2.6592,
+        initializer_factor=1.0,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        if text_config is None:
+            text_config = {}
+            logger.info("`text_config` is `None`. Initializing the `ClvpEncoderConfig` with default values.")
+        if speech_config is None:
+            speech_config = {}
+            logger.info("`speech_config` is `None`. initializing the `ClvpEncoderConfig` with default values.")
+        if decoder_config is None:
+            decoder_config = {}
+            logger.info("`decoder_config` is `None`. initializing the `ClvpDecoderConfig` with default values.")
+        self.text_config = ClvpEncoderConfig(**text_config)
+        self.speech_config = ClvpEncoderConfig(**speech_config)
+        self.decoder_config = ClvpDecoderConfig(**decoder_config)
+        self.projection_dim = projection_dim
+        self.logit_scale_init_value = logit_scale_init_value
+        self.initializer_factor = initializer_factor
+    @classmethod
+    def from_sub_model_configs(
+        cls,
+        text_config: ClvpEncoderConfig,
+        speech_config: ClvpEncoderConfig,
+        decoder_config: ClvpDecoderConfig,
+        **kwargs,
+    ):
+        r"""
+        Instantiate a [`ClvpConfig`] (or a derived class) from CLVP text model configuration, CLVP speech model
+        configuration and CLVP decoder model configuration.
+        Args:
+            text_config (`ClvpEncoderConfig`):
+                Text model configuration of type [`ClvpEncoderConfig`].
+            speech_config (`ClvpEncoderConfig`):
+                Speech model configuration of type [`ClvpEncoderConfig`].
+            decoder_config (`ClvpDecoderConfig`):
+                Decoder model configuration of type [`ClvpDecoderConfig`].
+        Returns:
+            [`ClvpConfig`]: An instance of a configuration object
+        """
+        return cls(
+            text_config=text_config.to_dict(),
+            speech_config=speech_config.to_dict(),
+            decoder_config=decoder_config.to_dict(),
+            **kwargs,
+        )
--- a/src/transformers/models/clvp/convert_clvp_to_hf.py
+++ b/src/transformers/models/clvp/convert_clvp_to_hf.py
+# coding=utf-8
+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Weights conversion script for CLVP
+"""
+import argparse
+import os
+import torch
+from huggingface_hub import hf_hub_download
+from transformers import ClvpConfig, ClvpModelForConditionalGeneration
+_MODELS = {
+    "clvp": "https://huggingface.co/jbetker/tortoise-tts-v2/blob/main/.models/clvp2.pth",
+    "decoder": "https://huggingface.co/jbetker/tortoise-tts-v2/blob/main/.models/autoregressive.pth",
+}
+dim = 1024
+sub_dim = dim // 16
+CLVP_ENCODERS_MAPPING = {
+    "text_transformer.transformer.attn_layers": "text_encoder_model",
+    "speech_transformer.transformer.attn_layers": "speech_encoder_model",
+    "text_transformer.transformer.norm": "text_encoder_model.final_layer_norm",
+    "speech_transformer.transformer.norm": "speech_encoder_model.final_layer_norm",
+    "to_text_latent": "text_encoder_model.projection",
+    "to_speech_latent": "speech_encoder_model.projection",
+    "text_emb": "text_encoder_model.token_embedding",
+    "speech_emb": "speech_encoder_model.token_embedding",
+    "1.wrap.net.0": "mlp.fc1",
+    "1.wrap.net.3": "mlp.fc2",
+    "1.wrap": "self_attn",
+    "to_out": "out_proj",
+    "to_q": "q_proj",
+    "to_k": "k_proj",
+    "to_v": "v_proj",
+    "temperature": "logit_scale",
+}
+CLVP_DECODER_MAPPING = {
+    "conditioning_encoder.init": "conditioning_encoder.mel_conv",
+    "conditioning_encoder.attn": "conditioning_encoder.mel_attn_blocks",
+    "mel_attn_blocks": "group_norms",
+    ".norm.weight": ".weight",
+    ".norm.bias": ".bias",
+    "text_embedding": "conditioning_encoder.text_token_embedding",
+    "text_pos_embedding.emb": "conditioning_encoder.text_position_embedding",
+    "final_norm": "speech_decoder_model.final_norm",
+    "mel_head": "speech_decoder_model.lm_head",
+    "gpt.ln_f": "speech_decoder_model.model.decoder.layer_norm",
+    "mel_embedding": "speech_decoder_model.model.decoder.input_embeds_layer",
+    "mel_pos_embedding.emb": "speech_decoder_model.model.decoder.position_embeds_layer",
+    "gpt.h": "speech_decoder_model.model.decoder.layers",
+    "ln_1": "input_layernorm",
+    "ln_2": "post_attention_layernorm",
+}
+def update_index(present_index):
+    if present_index % 2 == 0:
+        return int(present_index / 2)
+    else:
+        return int((present_index - 1) / 2)
+def convert_encoder_weights(original_weights):
+    converted_weights = {}
+    original_weights_keys = sorted(original_weights.keys())
+    for original_key in original_weights_keys:
+        updated_key = original_key
+        # for input_rmsnorm.weight and post_attention_rmsnorm.weight
+        if "0.0.g" in updated_key:
+            present_index = updated_key.split(".")[4]
+            if int(present_index) % 2 == 0:
+                updated_key = updated_key.replace("0.0.g", "input_rmsnorm.weight")
+            else:
+                updated_key = updated_key.replace("0.0.g", "post_attention_rmsnorm.weight")
+        if "transformer.attn_layers.layers" in updated_key:
+            present_index = updated_key.split(".")[4]
+            updated_index = update_index(int(present_index))
+            updated_key = updated_key.replace(
+                f"transformer.attn_layers.layers.{present_index}", f"transformer.attn_layers.layers.{updated_index}"
+            )
+        for k, v in CLVP_ENCODERS_MAPPING.items():
+            if k in updated_key:
+                updated_key = updated_key.replace(k, v)
+        converted_weights[updated_key] = original_weights.pop(original_key)
+    return converted_weights
+def convert_decoder_weights(original_weights):
+    converted_weights = {}
+    original_weights_keys = sorted(original_weights.keys())
+    for original_key in original_weights_keys:
+        updated_key = original_key
+        if len(updated_key.split(".")) > 3:
+            index, attr = updated_key.split(".")[2], updated_key.split(".")[-1]
+        # for decoder attention
+        if "attn.c_attn" in updated_key:
+            if attr == "weight":
+                slice1, slice2, slice3 = original_weights[updated_key].squeeze(-1).T.split(split_size=dim, dim=0)
+            else:
+                slice1, slice2, slice3 = original_weights[updated_key].split(split_size=dim, dim=0)
+            converted_weights[f"speech_decoder_model.model.decoder.layers.{index}.attn.q_proj.{attr}"] = slice1
+            converted_weights[f"speech_decoder_model.model.decoder.layers.{index}.attn.k_proj.{attr}"] = slice2
+            converted_weights[f"speech_decoder_model.model.decoder.layers.{index}.attn.v_proj.{attr}"] = slice3
+            continue
+        if "attn.c_proj" in updated_key:
+            converted_weights[f"speech_decoder_model.model.decoder.layers.{index}.attn.out_proj.{attr}"] = (
+                original_weights[updated_key].squeeze(-1).T
+            )
+            continue
+        if "attn.bias" in updated_key or "attn.masked_bias" in updated_key or "text_head" in updated_key:
+            original_weights.pop(updated_key)
+            continue
+        # conditional encoder attention
+        if "qkv" in updated_key:
+            if attr == "weight":
+                slice1, slice2, slice3 = original_weights[updated_key].squeeze(-1).split(split_size=dim, dim=0)
+            else:
+                slice1, slice2, slice3 = original_weights[updated_key].split(split_size=dim, dim=0)
+            indices = torch.arange(dim)
+            index1, index2, index3 = (
+                indices.unfold(0, sub_dim, sub_dim * 3).flatten(),
+                indices[sub_dim:].unfold(0, sub_dim, sub_dim * 3).flatten(),
+                indices[2 * sub_dim :].unfold(0, sub_dim, sub_dim * 3).flatten(),
+            )
+            converted_weights[f"conditioning_encoder.mel_attn_blocks.{index}.q_proj.{attr}"] = torch.concatenate(
+                [slice1[index1], slice2[index3], slice3[index2]],
+                axis=0,
+            )
+            converted_weights[f"conditioning_encoder.mel_attn_blocks.{index}.k_proj.{attr}"] = torch.concatenate(
+                [slice1[index2], slice2[index1], slice3[index3]],
+                axis=0,
+            )
+            converted_weights[f"conditioning_encoder.mel_attn_blocks.{index}.v_proj.{attr}"] = torch.concatenate(
+                [slice1[index3], slice2[index2], slice3[index1]],
+                axis=0,
+            )
+            continue
+        if "proj_out" in updated_key:
+            converted_weights[f"conditioning_encoder.mel_attn_blocks.{index}.out_proj.{attr}"] = original_weights[
+                updated_key
+            ].squeeze(-1)
+            continue
+        for k, v in CLVP_DECODER_MAPPING.items():
+            if k in updated_key:
+                updated_key = updated_key.replace(k, v)
+        converted_weights[updated_key] = original_weights.pop(original_key)
+    return converted_weights
+def _download(url: str, root: str):
+    repo_id = f"{url.split('/')[3]}/{url.split('/')[4]}"
+    filename = f"{url.split('/')[-2]}/{url.split('/')[-1]}"
+    hf_hub_download(
+        repo_id=repo_id,
+        filename=filename,
+        force_filename=root,
+        local_dir_use_symlinks=False,
+    )
+def convert_clvp_weights(checkpoint_path, pytorch_dump_folder_path):
+    converted_checkpoint = {}
+    for each_model_name, each_model_url in _MODELS.items():
+        each_model_path = os.path.join(checkpoint_path, each_model_url.split("/")[-1])
+        if not os.path.exists(each_model_path):
+            print(f"\n{each_model_name} was not found! Downloading it to {each_model_path}")
+            _download(url=each_model_url, root=each_model_path)
+        if each_model_name == "clvp":
+            clvp_checkpoint = torch.load(each_model_path, map_location="cpu")
+        else:
+            decoder_checkpoint = torch.load(each_model_path, map_location="cpu")
+    # Converting the weights
+    converted_checkpoint.update(**convert_encoder_weights(clvp_checkpoint))
+    converted_checkpoint.update(**convert_decoder_weights(decoder_checkpoint))
+    config = ClvpConfig.from_pretrained("susnato/clvp_dev")
+    model = ClvpModelForConditionalGeneration(config)
+    model.load_state_dict(converted_checkpoint, strict=True)
+    model.save_pretrained(pytorch_dump_folder_path)
+    print(f"Model saved at {pytorch_dump_folder_path}!")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    # # Required parameters
+    parser.add_argument(
+        "--checkpoint_path", type=str, help="Path to the folder of downloaded checkpoints. (Please enter full path)"
+    )
+    parser.add_argument(
+        "--pytorch_dump_folder_path",
+        default=None,
+        type=str,
+        help="Path to the output PyTorch model. (Please enter full path)",
+    )
+    args = parser.parse_args()
+    convert_clvp_weights(args.checkpoint_path, args.pytorch_dump_folder_path)