Add CLVP (#24745)

* init commit * attention arch done except rotary emb * rotary emb done * text encoder working * outputs matching * arch first pass done * make commands done, tests and docs remaining * all tests passed, only docs remaining * docs done * doc-builder fix * convert script removed(not relevant) * minor comments done * added ckpt conversion script * tokenizer done * very minor fix of index.md 2 * mostly make fixup related * all done except fe and rotary emb * very small change * removed unidecode dependency * style changes * tokenizer removed require_backends * added require_inflect to tokenizer tests * removed VOCAB_FILES in tokenizer test * inflect dependency removed * added rotary pos emb cache and simplified the apply method * style * little doc change * more comments * feature extractor added * added processor * auto-regressive config added * added CLVPConditioningEncoder * comments done except the test one * weights added successfull(NOT tested) * tokenizer fix with numbers * generate outputs matching * almost tests passing Integ tests not written * Integ tests added * major CUDA error fixed * docs done * rebase and multiple fixes * fixed rebase overwrites * generate code simplified and tests for AutoRegressive model added * minor changes * refectored gpt2 code in clvp file * weights done and all code refactored * mostly done except the fast_tokenizer * doc test fix * config file's doc fixes * more config fix * more comments * tokenizer comments mostly done * modeling file mostly refactored and can load modules * ClvpEncoder tested * ClvpDecoder, ClvpModel and ClvpForCausalLM tested * integration and all tests passed * more fixes * docs almost done * ckpt conversion refectored * style and some failing tests fix * comments * temporary output fix but test_assisted_decoding_matches_greedy_search test fails * majority changes done * use_cache outputs same now! Along with the asisted_greedy_decoding test fix * more comments * more comments * prepare_inputs_for_generation fixed and _prepare_model_inputs added * style fix * clvp.md change * moved clvpconditionalencoder norms * add model to new index * added tokenizer input_ids_with_special_tokens * small fix * config mostly done * added config-tester and changed conversion script * more comments * comments * style fix * some comments * tokenizer changed back to prev state * small commnets * added output hidden states for the main model * style fix * comments * small change * revert small change * . * Update clvp.md * Update test_modeling_clvp.py * :) * some minor change * new fixes * remove to_dict from FE

Add CLVP (#24745)
* init commit * attention arch done except rotary emb * rotary emb done * text encoder working * outputs matching * arch first pass done * make commands done, tests and docs remaining * all tests passed, only docs remaining * docs done * doc-builder fix * convert script removed(not relevant) * minor comments done * added ckpt conversion script * tokenizer done * very minor fix of index.md 2 * mostly make fixup related * all done except fe and rotary emb * very small change * removed unidecode dependency * style changes * tokenizer removed require_backends * added require_inflect to tokenizer tests * removed VOCAB_FILES in tokenizer test * inflect dependency removed * added rotary pos emb cache and simplified the apply method * style * little doc change * more comments * feature extractor added * added processor * auto-regressive config added * added CLVPConditioningEncoder * comments done except the test one * weights added successfull(NOT tested) * tokenizer fix with numbers * generate outputs matching * almost tests passing Integ tests not written * Integ tests added * major CUDA error fixed * docs done * rebase and multiple fixes * fixed rebase overwrites * generate code simplified and tests for AutoRegressive model added * minor changes * refectored gpt2 code in clvp file * weights done and all code refactored * mostly done except the fast_tokenizer * doc test fix * config file's doc fixes * more config fix * more comments * tokenizer comments mostly done * modeling file mostly refactored and can load modules * ClvpEncoder tested * ClvpDecoder, ClvpModel and ClvpForCausalLM tested * integration and all tests passed * more fixes * docs almost done * ckpt conversion refectored * style and some failing tests fix * comments * temporary output fix but test_assisted_decoding_matches_greedy_search test fails * majority changes done * use_cache outputs same now! Along with the asisted_greedy_decoding test fix * more comments * more comments * prepare_inputs_for_generation fixed and _prepare_model_inputs added * style fix * clvp.md change * moved clvpconditionalencoder norms * add model to new index * added tokenizer input_ids_with_special_tokens * small fix * config mostly done * added config-tester and changed conversion script * more comments * comments * style fix * some comments * tokenizer changed back to prev state * small commnets * added output hidden states for the main model * style fix * comments * small change * revert small change * . * Update clvp.md * Update test_modeling_clvp.py * :) * some minor change * new fixes * remove to_dict from FE
7e9f10ac · Susnato Dhar · GitHub · 9dd58c53 · 7e9f10ac · 7e9f10ac
Unverified Commit 7e9f10ac authored Nov 10, 2023 by Susnato Dhar Committed by GitHub Nov 10, 2023
20 changed files
--- a/README.md
+++ b/README.md
@@ -321,6 +321,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
 1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
+1. **[CLVP](https://huggingface.co/docs/transformers/main/model_doc/clvp)** released with the paper [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243) by James Betker.
 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
 1. **[CodeLlama](https://huggingface.co/docs/transformers/model_doc/llama_code)** (from MetaAI) released with the paper [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) by Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve.
 1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (from Microsoft Research Asia) released with the paper [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) by Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang.

--- a/README_es.md
+++ b/README_es.md
@@ -296,6 +296,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
 1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
+1. **[CLVP](https://huggingface.co/docs/transformers/main/model_doc/clvp)** released with the paper [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243) by James Betker. 
 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
 1. **[CodeLlama](https://huggingface.co/docs/transformers/model_doc/llama_code)** (from MetaAI) released with the paper [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) by Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve.
 1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (from Microsoft Research Asia) released with the paper [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) by Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang.

--- a/README_hd.md
+++ b/README_hd.md
@@ -270,6 +270,7 @@ conda install -c huggingface transformers
 1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (LAION-AI से) Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov. द्वाराअनुसंधान पत्र [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687) के साथ जारी किया गया
 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (OpenAI से) साथ वाला पेपर [लर्निंग ट्रांसफरेबल विजुअल मॉडल फ्रॉम नेचुरल लैंग्वेज सुपरविजन](https://arxiv.org /abs/2103.00020) एलेक रैडफोर्ड, जोंग वूक किम, क्रिस हैलासी, आदित्य रमेश, गेब्रियल गोह, संध्या अग्रवाल, गिरीश शास्त्री, अमांडा एस्केल, पामेला मिश्किन, जैक क्लार्क, ग्रेचेन क्रुएगर, इल्या सुत्स्केवर द्वारा।
 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
+1. **[CLVP](https://huggingface.co/docs/transformers/main/model_doc/clvp)** released with the paper [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243) by James Betker. 
 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (सेल्सफोर्स से) साथ में पेपर [प्रोग्राम सिंथेसिस के लिए एक संवादात्मक प्रतिमान](https://arxiv.org/abs/2203.13474) एरिक निजकैंप, बो पैंग, हिरोआकी हयाशी, लिफू तू, हुआन वांग, यिंगबो झोउ, सिल्वियो सावरेस, कैमिंग जिओंग रिलीज।
 1. **[CodeLlama](https://huggingface.co/docs/transformers/model_doc/llama_code)** (MetaAI से) Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve. द्वाराअनुसंधान पत्र [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) के साथ जारी किया गया
 1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (माइक्रोसॉफ्ट रिसर्च एशिया से) कागज के साथ [फास्ट ट्रेनिंग कन्वर्जेंस के लिए सशर्त डीईटीआर](https://arxiv. org/abs/2108.06152) डेपू मेंग, ज़ियाओकांग चेन, ज़ेजिया फैन, गैंग ज़ेंग, होउकियांग ली, युहुई युआन, लेई सन, जिंगडोंग वांग द्वारा।

--- a/README_ja.md
+++ b/README_ja.md
@@ -330,6 +330,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
 1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (LAION-AI から) Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov. から公開された研究論文 [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687)
 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (OpenAI から) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever から公開された研究論文: [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020)
 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (University of Göttingen から) Timo Lüddecke and Alexander Ecker から公開された研究論文: [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003)
+1. **[CLVP](https://huggingface.co/docs/transformers/main/model_doc/clvp)** released with the paper [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243) by James Betker. 
 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (Salesforce から) Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong から公開された研究論文: [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474)
 1. **[CodeLlama](https://huggingface.co/docs/transformers/model_doc/llama_code)** (MetaAI から) Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve. から公開された研究論文 [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/)
 1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (Microsoft Research Asia から) Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang から公開された研究論文: [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152)

--- a/README_ko.md
+++ b/README_ko.md
@@ -245,6 +245,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
 1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (LAION-AI 에서 제공)은 Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.의 [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687)논문과 함께 발표했습니다.
 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (OpenAI 에서) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever 의 [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) 논문과 함께 발표했습니다.
 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (University of Göttingen 에서) Timo Lüddecke and Alexander Ecker 의 [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) 논문과 함께 발표했습니다.
+1. **[CLVP](https://huggingface.co/docs/transformers/main/model_doc/clvp)** released with the paper [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243) by James Betker. 
 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (Salesforce 에서) Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong 의 [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) 논문과 함께 발표했습니다.
 1. **[CodeLlama](https://huggingface.co/docs/transformers/model_doc/llama_code)** (MetaAI 에서 제공)은 Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve.의 [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/)논문과 함께 발표했습니다.
 1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (Microsoft Research Asia 에서) Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang 의 [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) 논문과 함께 발표했습니다.

--- a/README_zh-hans.md
+++ b/README_zh-hans.md
@@ -269,6 +269,7 @@ conda install -c huggingface transformers
 1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (来自 LAION-AI) 伴随论文 [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687) 由 Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov 发布。
 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (来自 OpenAI) 伴随论文 [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) 由 Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever 发布。
 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (来自 University of Göttingen) 伴随论文 [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) 由 Timo Lüddecke and Alexander Ecker 发布。
+1. **[CLVP](https://huggingface.co/docs/transformers/main/model_doc/clvp)** released with the paper [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243) by James Betker. 
 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (来自 Salesforce) 伴随论文 [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) 由 Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong 发布。
 1. **[CodeLlama](https://huggingface.co/docs/transformers/model_doc/llama_code)** (来自 MetaAI) 伴随论文 [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) 由 Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve 发布。
 1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (来自 Microsoft Research Asia) 伴随论文 [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) 由 Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang 发布。

--- a/README_zh-hant.md
+++ b/README_zh-hant.md
@@ -281,6 +281,7 @@ conda install -c huggingface transformers
 1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
+1. **[CLVP](https://huggingface.co/docs/transformers/main/model_doc/clvp)** released with the paper [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243) by James Betker. 
 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
 1. **[CodeLlama](https://huggingface.co/docs/transformers/model_doc/llama_code)** (from MetaAI) released with the paper [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) by Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve.
 1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (from Microsoft Research Asia) released with the paper [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) by Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang.

--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -663,6 +663,8 @@
        title: CLIP
      - local: model_doc/clipseg
        title: CLIPSeg
+      - local: model_doc/clvp
+        title: CLVP
      - local: model_doc/data2vec
        title: Data2Vec
      - local: model_doc/deplot

--- a/docs/source/en/index.md
+++ b/docs/source/en/index.md
@@ -92,6 +92,7 @@ Flax), PyTorch, and/or TensorFlow.
 |                          [CLAP](model_doc/clap)                          |       ✅        |         ❌         |      ❌      |
 |                          [CLIP](model_doc/clip)                          |       ✅        |         ✅         |      ✅      |
 |                       [CLIPSeg](model_doc/clipseg)                       |       ✅        |         ❌         |      ❌      |
+|                          [CLVP](model_doc/clvp)                          |       ✅        |         ❌         |      ❌      |
 |                       [CodeGen](model_doc/codegen)                       |       ✅        |         ❌         |      ❌      |
 |                    [CodeLlama](model_doc/code_llama)                     |       ✅        |         ❌         |      ❌      |
 |              [Conditional DETR](model_doc/conditional_detr)              |       ✅        |         ❌         |      ❌      |

--- a/docs/source/en/model_doc/clvp.md
+++ b/docs/source/en/model_doc/clvp.md
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# CLVP
+
+## Overview
+
+The CLVP (Contrastive Language-Voice Pretrained Transformer) model was proposed in [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243) by James Betker.
+
+The abstract from the paper is the following:
+
+*In recent years, the field of image generation has been revolutionized by the application of autoregressive transformers and DDPMs. These approaches model the process of image generation as a step-wise probabilistic processes and leverage large amounts of compute and data to learn the image distribution. This methodology of improving performance need not be confined to images. This paper describes a way to apply advances in the image generative domain to speech synthesis. The result is TorToise - an expressive, multi-voice text-to-speech system.*
+
+
+This model was contributed by [Susnato Dhar](https://huggingface.co/susnato).
+The original code can be found [here](https://github.com/neonbjb/tortoise-tts).
+
+
+## Usage tips
+
+1. CLVP is an integral part of the Tortoise TTS model.
+2. CLVP can be used to compare different generated speech candidates with the provided text, and the best speech tokens are forwarded to the diffusion model.
+3. The use of the [`ClvpModelForConditionalGeneration.generate()`] method is strongly recommended for tortoise usage.
+4. Note that the CLVP model expects the audio to be sampled at 22.05 kHz contrary to other audio models which expects 16 kHz. 
+
+
+## Brief Explanation:
+
+- The [`ClvpTokenizer`] tokenizes the text input, and the [`ClvpFeatureExtractor`] extracts the log mel-spectrogram from the desired audio.
+- [`ClvpConditioningEncoder`] takes those text tokens and audio representations and converts them into embeddings conditioned on the text and audio.
+- The [`ClvpForCausalLM`] uses those embeddings to generate multiple speech candidates.
+- Each speech candidate is passed through the speech encoder ([`ClvpEncoder`]) which converts them into a vector representation, and the text encoder ([`ClvpEncoder`]) converts the text tokens into the same latent space. 
+- At the end, we compare each speech vector with the text vector to see which speech vector is most similar to the text vector. 
+- [`ClvpModelForConditionalGeneration.generate()`] compresses all of the logic described above into a single method.  
+
+
+Example :
+
+```python
+>>> import datasets
+>>> from transformers import ClvpProcessor, ClvpModelForConditionalGeneration
+
+>>> # Define the Text and Load the Audio (We are taking an audio example from HuggingFace Hub using `datasets` library).
+>>> text = "This is an example text."
+
+>>> ds = datasets.load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+>>> ds = ds.cast_column("audio", datasets.Audio(sampling_rate=22050))
+>>> sample = ds[0]["audio"]
+
+>>> # Define processor and model.
+>>> processor = ClvpProcessor.from_pretrained("susnato/clvp_dev")
+>>> model = ClvpModelForConditionalGeneration.from_pretrained("susnato/clvp_dev")
+
+>>> # Generate processor output and model output.
+>>> processor_output = processor(raw_speech=sample["array"], sampling_rate=sample["sampling_rate"], text=text, return_tensors="pt")
+>>> generated_output = model.generate(**processor_output)
+```
+
+
+## ClvpConfig
+
+[[autodoc]] ClvpConfig
+    - from_sub_model_configs
+
+## ClvpEncoderConfig
+
+[[autodoc]] ClvpEncoderConfig
+
+## ClvpDecoderConfig
+
+[[autodoc]] ClvpDecoderConfig
+
+## ClvpTokenizer
+
+[[autodoc]] ClvpTokenizer
+    - save_vocabulary
+
+## ClvpFeatureExtractor
+
+[[autodoc]] ClvpFeatureExtractor
+    - __call__
+
+## ClvpProcessor
+
+[[autodoc]] ClvpProcessor
+    - __call__
+    - decode
+    - batch_decode
+
+## ClvpModelForConditionalGeneration
+
+[[autodoc]] ClvpModelForConditionalGeneration
+    - forward
+    - generate
+    - get_text_features
+    - get_speech_features
+
+## ClvpForCausalLM
+
+[[autodoc]] ClvpForCausalLM
+
+## ClvpModel
+
+[[autodoc]] ClvpModel
+
+## ClvpEncoder
+
+[[autodoc]] ClvpEncoder
+
+## ClvpDecoder
+
+[[autodoc]] ClvpDecoder
+
--- a/src/transformers/__init__.py
+++ b/src/transformers/__init__.py
@@ -256,6 +256,15 @@ _import_structure = {
        "CLIPSegTextConfig",
        "CLIPSegVisionConfig",
    ],
+    "models.clvp": [
+        "CLVP_PRETRAINED_CONFIG_ARCHIVE_MAP",
+        "ClvpConfig",
+        "ClvpDecoderConfig",
+        "ClvpEncoderConfig",
+        "ClvpFeatureExtractor",
+        "ClvpProcessor",
+        "ClvpTokenizer",
+    ],
    "models.code_llama": [],
    "models.codegen": ["CODEGEN_PRETRAINED_CONFIG_ARCHIVE_MAP", "CodeGenConfig", "CodeGenTokenizer"],
    "models.conditional_detr": ["CONDITIONAL_DETR_PRETRAINED_CONFIG_ARCHIVE_MAP", "ConditionalDetrConfig"],
@@ -1458,6 +1467,17 @@ else:
            "CLIPSegVisionModel",
        ]
    )
+    _import_structure["models.clvp"].extend(
+        [
+            "CLVP_PRETRAINED_MODEL_ARCHIVE_LIST",
+            "ClvpDecoder",
+            "ClvpEncoder",
+            "ClvpForCausalLM",
+            "ClvpModel",
+            "ClvpModelForConditionalGeneration",
+            "ClvpPreTrainedModel",
+        ]
+    )
    _import_structure["models.codegen"].extend(
        [
            "CODEGEN_PRETRAINED_MODEL_ARCHIVE_LIST",
@@ -4446,6 +4466,15 @@ if TYPE_CHECKING:
        CLIPSegTextConfig,
        CLIPSegVisionConfig,
    )
+    from .models.clvp import (
+        CLVP_PRETRAINED_CONFIG_ARCHIVE_MAP,
+        ClvpConfig,
+        ClvpDecoderConfig,
+        ClvpEncoderConfig,
+        ClvpFeatureExtractor,
+        ClvpProcessor,
+        ClvpTokenizer,
+    )
    from .models.codegen import CODEGEN_PRETRAINED_CONFIG_ARCHIVE_MAP, CodeGenConfig, CodeGenTokenizer
    from .models.conditional_detr import CONDITIONAL_DETR_PRETRAINED_CONFIG_ARCHIVE_MAP, ConditionalDetrConfig
    from .models.convbert import CONVBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, ConvBertConfig, ConvBertTokenizer
@@ -5516,6 +5545,15 @@ if TYPE_CHECKING:
            CLIPSegTextModel,
            CLIPSegVisionModel,
        )
+        from .models.clvp import (
+            CLVP_PRETRAINED_MODEL_ARCHIVE_LIST,
+            ClvpDecoder,
+            ClvpEncoder,
+            ClvpForCausalLM,
+            ClvpModel,
+            ClvpModelForConditionalGeneration,
+            ClvpPreTrainedModel,
+        )
        from .models.codegen import (
            CODEGEN_PRETRAINED_MODEL_ARCHIVE_LIST,
            CodeGenForCausalLM,

--- a/src/transformers/models/__init__.py
+++ b/src/transformers/models/__init__.py
@@ -46,6 +46,7 @@ from . import (
    clap,
    clip,
    clipseg,
+    clvp,
    code_llama,
    codegen,
    conditional_detr,

--- a/src/transformers/models/auto/configuration_auto.py
+++ b/src/transformers/models/auto/configuration_auto.py
@@ -57,6 +57,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
        ("clap", "ClapConfig"),
        ("clip", "CLIPConfig"),
        ("clipseg", "CLIPSegConfig"),
+        ("clvp", "ClvpConfig"),
        ("code_llama", "LlamaConfig"),
        ("codegen", "CodeGenConfig"),
        ("conditional_detr", "ConditionalDetrConfig"),
@@ -276,6 +277,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
        ("clap", "CLAP_PRETRAINED_MODEL_ARCHIVE_LIST"),
        ("clip", "CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("clipseg", "CLIPSEG_PRETRAINED_CONFIG_ARCHIVE_MAP"),
+        ("clvp", "CLVP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("codegen", "CODEGEN_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("conditional_detr", "CONDITIONAL_DETR_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("convbert", "CONVBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
@@ -481,6 +483,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
        ("clap", "CLAP"),
        ("clip", "CLIP"),
        ("clipseg", "CLIPSeg"),
+        ("clvp", "CLVP"),
        ("code_llama", "CodeLlama"),
        ("codegen", "CodeGen"),
        ("conditional_detr", "Conditional DETR"),

--- a/src/transformers/models/auto/feature_extraction_auto.py
+++ b/src/transformers/models/auto/feature_extraction_auto.py
@@ -44,6 +44,7 @@ FEATURE_EXTRACTOR_MAPPING_NAMES = OrderedDict(
        ("clap", "ClapFeatureExtractor"),
        ("clip", "CLIPFeatureExtractor"),
        ("clipseg", "ViTFeatureExtractor"),
+        ("clvp", "ClvpFeatureExtractor"),
        ("conditional_detr", "ConditionalDetrFeatureExtractor"),
        ("convnext", "ConvNextFeatureExtractor"),
        ("cvt", "ConvNextFeatureExtractor"),

--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@@ -55,6 +55,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
        ("clap", "ClapModel"),
        ("clip", "CLIPModel"),
        ("clipseg", "CLIPSegModel"),
+        ("clvp", "ClvpModelForConditionalGeneration"),
        ("code_llama", "LlamaModel"),
        ("codegen", "CodeGenModel"),
        ("conditional_detr", "ConditionalDetrModel"),

--- a/src/transformers/models/auto/processing_auto.py
+++ b/src/transformers/models/auto/processing_auto.py
@@ -53,6 +53,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
        ("clap", "ClapProcessor"),
        ("clip", "CLIPProcessor"),
        ("clipseg", "CLIPSegProcessor"),
+        ("clvp", "ClvpProcessor"),
        ("flava", "FlavaProcessor"),
        ("fuyu", "FuyuProcessor"),
        ("git", "GitProcessor"),

--- a/src/transformers/models/auto/tokenization_auto.py
+++ b/src/transformers/models/auto/tokenization_auto.py
@@ -121,6 +121,7 @@ else:
                    "CLIPTokenizerFast" if is_tokenizers_available() else None,
                ),
            ),
+            ("clvp", ("ClvpTokenizer", None)),
            (
                "code_llama",
                (

--- a/src/transformers/models/clvp/__init__.py
+++ b/src/transformers/models/clvp/__init__.py
+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+
+from ...utils import (
+    OptionalDependencyNotAvailable,
+    _LazyModule,
+    is_torch_available,
+)
+
+
+_import_structure = {
+    "configuration_clvp": [
+        "CLVP_PRETRAINED_CONFIG_ARCHIVE_MAP",
+        "ClvpConfig",
+        "ClvpDecoderConfig",
+        "ClvpEncoderConfig",
+    ],
+    "feature_extraction_clvp": ["ClvpFeatureExtractor"],
+    "processing_clvp": ["ClvpProcessor"],
+    "tokenization_clvp": ["ClvpTokenizer"],
+}
+
+
+try:
+    if not is_torch_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["modeling_clvp"] = [
+        "CLVP_PRETRAINED_MODEL_ARCHIVE_LIST",
+        "ClvpModelForConditionalGeneration",
+        "ClvpForCausalLM",
+        "ClvpModel",
+        "ClvpPreTrainedModel",
+        "ClvpEncoder",
+        "ClvpDecoder",
+    ]
+
+
+if TYPE_CHECKING:
+    from .configuration_clvp import (
+        CLVP_PRETRAINED_CONFIG_ARCHIVE_MAP,
+        ClvpConfig,
+        ClvpDecoderConfig,
+        ClvpEncoderConfig,
+    )
+    from .feature_extraction_clvp import ClvpFeatureExtractor
+    from .processing_clvp import ClvpProcessor
+    from .tokenization_clvp import ClvpTokenizer
+
+    try:
+        if not is_torch_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .modeling_clvp import (
+            CLVP_PRETRAINED_MODEL_ARCHIVE_LIST,
+            ClvpDecoder,
+            ClvpEncoder,
+            ClvpForCausalLM,
+            ClvpModel,
+            ClvpModelForConditionalGeneration,
+            ClvpPreTrainedModel,
+        )
+
+else:
+    import sys
+
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
--- a/src/transformers/models/clvp/configuration_clvp.py
+++ b/src/transformers/models/clvp/configuration_clvp.py
--- a/src/transformers/models/clvp/convert_clvp_to_hf.py
+++ b/src/transformers/models/clvp/convert_clvp_to_hf.py
+# coding=utf-8
+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Weights conversion script for CLVP
+"""
+
+import argparse
+import os
+
+import torch
+from huggingface_hub import hf_hub_download
+
+from transformers import ClvpConfig, ClvpModelForConditionalGeneration
+
+
+_MODELS = {
+    "clvp": "https://huggingface.co/jbetker/tortoise-tts-v2/blob/main/.models/clvp2.pth",
+    "decoder": "https://huggingface.co/jbetker/tortoise-tts-v2/blob/main/.models/autoregressive.pth",
+}
+
+dim = 1024
+sub_dim = dim // 16
+
+CLVP_ENCODERS_MAPPING = {
+    "text_transformer.transformer.attn_layers": "text_encoder_model",
+    "speech_transformer.transformer.attn_layers": "speech_encoder_model",
+    "text_transformer.transformer.norm": "text_encoder_model.final_layer_norm",
+    "speech_transformer.transformer.norm": "speech_encoder_model.final_layer_norm",
+    "to_text_latent": "text_encoder_model.projection",
+    "to_speech_latent": "speech_encoder_model.projection",
+    "text_emb": "text_encoder_model.token_embedding",
+    "speech_emb": "speech_encoder_model.token_embedding",
+    "1.wrap.net.0": "mlp.fc1",
+    "1.wrap.net.3": "mlp.fc2",
+    "1.wrap": "self_attn",
+    "to_out": "out_proj",
+    "to_q": "q_proj",
+    "to_k": "k_proj",
+    "to_v": "v_proj",
+    "temperature": "logit_scale",
+}
+
+CLVP_DECODER_MAPPING = {
+    "conditioning_encoder.init": "conditioning_encoder.mel_conv",
+    "conditioning_encoder.attn": "conditioning_encoder.mel_attn_blocks",
+    "mel_attn_blocks": "group_norms",
+    ".norm.weight": ".weight",
+    ".norm.bias": ".bias",
+    "text_embedding": "conditioning_encoder.text_token_embedding",
+    "text_pos_embedding.emb": "conditioning_encoder.text_position_embedding",
+    "final_norm": "speech_decoder_model.final_norm",
+    "mel_head": "speech_decoder_model.lm_head",
+    "gpt.ln_f": "speech_decoder_model.model.decoder.layer_norm",
+    "mel_embedding": "speech_decoder_model.model.decoder.input_embeds_layer",
+    "mel_pos_embedding.emb": "speech_decoder_model.model.decoder.position_embeds_layer",
+    "gpt.h": "speech_decoder_model.model.decoder.layers",
+    "ln_1": "input_layernorm",
+    "ln_2": "post_attention_layernorm",
+}
+
+
+def update_index(present_index):
+    if present_index % 2 == 0:
+        return int(present_index / 2)
+    else:
+        return int((present_index - 1) / 2)
+
+
+def convert_encoder_weights(original_weights):
+    converted_weights = {}
+    original_weights_keys = sorted(original_weights.keys())
+    for original_key in original_weights_keys:
+        updated_key = original_key
+        # for input_rmsnorm.weight and post_attention_rmsnorm.weight
+        if "0.0.g" in updated_key:
+            present_index = updated_key.split(".")[4]
+            if int(present_index) % 2 == 0:
+                updated_key = updated_key.replace("0.0.g", "input_rmsnorm.weight")
+            else:
+                updated_key = updated_key.replace("0.0.g", "post_attention_rmsnorm.weight")
+
+        if "transformer.attn_layers.layers" in updated_key:
+            present_index = updated_key.split(".")[4]
+            updated_index = update_index(int(present_index))
+            updated_key = updated_key.replace(
+                f"transformer.attn_layers.layers.{present_index}", f"transformer.attn_layers.layers.{updated_index}"
+            )
+
+        for k, v in CLVP_ENCODERS_MAPPING.items():
+            if k in updated_key:
+                updated_key = updated_key.replace(k, v)
+
+        converted_weights[updated_key] = original_weights.pop(original_key)
+
+    return converted_weights
+
+
+def convert_decoder_weights(original_weights):
+    converted_weights = {}
+    original_weights_keys = sorted(original_weights.keys())
+    for original_key in original_weights_keys:
+        updated_key = original_key
+        if len(updated_key.split(".")) > 3:
+            index, attr = updated_key.split(".")[2], updated_key.split(".")[-1]
+
+        # for decoder attention
+        if "attn.c_attn" in updated_key:
+            if attr == "weight":
+                slice1, slice2, slice3 = original_weights[updated_key].squeeze(-1).T.split(split_size=dim, dim=0)
+            else:
+                slice1, slice2, slice3 = original_weights[updated_key].split(split_size=dim, dim=0)
+            converted_weights[f"speech_decoder_model.model.decoder.layers.{index}.attn.q_proj.{attr}"] = slice1
+            converted_weights[f"speech_decoder_model.model.decoder.layers.{index}.attn.k_proj.{attr}"] = slice2
+            converted_weights[f"speech_decoder_model.model.decoder.layers.{index}.attn.v_proj.{attr}"] = slice3
+            continue
+
+        if "attn.c_proj" in updated_key:
+            converted_weights[f"speech_decoder_model.model.decoder.layers.{index}.attn.out_proj.{attr}"] = (
+                original_weights[updated_key].squeeze(-1).T
+            )
+            continue
+
+        if "attn.bias" in updated_key or "attn.masked_bias" in updated_key or "text_head" in updated_key:
+            original_weights.pop(updated_key)
+            continue
+
+        # conditional encoder attention
+        if "qkv" in updated_key:
+            if attr == "weight":
+                slice1, slice2, slice3 = original_weights[updated_key].squeeze(-1).split(split_size=dim, dim=0)
+            else:
+                slice1, slice2, slice3 = original_weights[updated_key].split(split_size=dim, dim=0)
+
+            indices = torch.arange(dim)
+            index1, index2, index3 = (
+                indices.unfold(0, sub_dim, sub_dim * 3).flatten(),
+                indices[sub_dim:].unfold(0, sub_dim, sub_dim * 3).flatten(),
+                indices[2 * sub_dim :].unfold(0, sub_dim, sub_dim * 3).flatten(),
+            )
+
+            converted_weights[f"conditioning_encoder.mel_attn_blocks.{index}.q_proj.{attr}"] = torch.concatenate(
+                [slice1[index1], slice2[index3], slice3[index2]],
+                axis=0,
+            )
+            converted_weights[f"conditioning_encoder.mel_attn_blocks.{index}.k_proj.{attr}"] = torch.concatenate(
+                [slice1[index2], slice2[index1], slice3[index3]],
+                axis=0,
+            )
+            converted_weights[f"conditioning_encoder.mel_attn_blocks.{index}.v_proj.{attr}"] = torch.concatenate(
+                [slice1[index3], slice2[index2], slice3[index1]],
+                axis=0,
+            )
+            continue
+
+        if "proj_out" in updated_key:
+            converted_weights[f"conditioning_encoder.mel_attn_blocks.{index}.out_proj.{attr}"] = original_weights[
+                updated_key
+            ].squeeze(-1)
+            continue
+
+        for k, v in CLVP_DECODER_MAPPING.items():
+            if k in updated_key:
+                updated_key = updated_key.replace(k, v)
+
+        converted_weights[updated_key] = original_weights.pop(original_key)
+
+    return converted_weights
+
+
+def _download(url: str, root: str):
+    repo_id = f"{url.split('/')[3]}/{url.split('/')[4]}"
+    filename = f"{url.split('/')[-2]}/{url.split('/')[-1]}"
+    hf_hub_download(
+        repo_id=repo_id,
+        filename=filename,
+        force_filename=root,
+        local_dir_use_symlinks=False,
+    )
+
+
+def convert_clvp_weights(checkpoint_path, pytorch_dump_folder_path):
+    converted_checkpoint = {}
+
+    for each_model_name, each_model_url in _MODELS.items():
+        each_model_path = os.path.join(checkpoint_path, each_model_url.split("/")[-1])
+        if not os.path.exists(each_model_path):
+            print(f"\n{each_model_name} was not found! Downloading it to {each_model_path}")
+            _download(url=each_model_url, root=each_model_path)
+
+        if each_model_name == "clvp":
+            clvp_checkpoint = torch.load(each_model_path, map_location="cpu")
+        else:
+            decoder_checkpoint = torch.load(each_model_path, map_location="cpu")
+
+    # Converting the weights
+    converted_checkpoint.update(**convert_encoder_weights(clvp_checkpoint))
+    converted_checkpoint.update(**convert_decoder_weights(decoder_checkpoint))
+
+    config = ClvpConfig.from_pretrained("susnato/clvp_dev")
+    model = ClvpModelForConditionalGeneration(config)
+
+    model.load_state_dict(converted_checkpoint, strict=True)
+    model.save_pretrained(pytorch_dump_folder_path)
+    print(f"Model saved at {pytorch_dump_folder_path}!")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    # # Required parameters
+    parser.add_argument(
+        "--checkpoint_path", type=str, help="Path to the folder of downloaded checkpoints. (Please enter full path)"
+    )
+    parser.add_argument(
+        "--pytorch_dump_folder_path",
+        default=None,
+        type=str,
+        help="Path to the output PyTorch model. (Please enter full path)",
+    )
+    args = parser.parse_args()
+
+    convert_clvp_weights(args.checkpoint_path, args.pytorch_dump_folder_path)