[CLAP] Add CLAP to the library (#21370)

* add model like clip * update * text model ok * clap text works * some refactor - `CLAPVision` to `CLAPAudio` - refactor kwargs of audio modules * more refactor * more refactor * more refactor * correct fusion * more refactor * new modules * add basic processor * fixup * remove whisper copioed from * audio logits match * add doc * correct filters mel and add maxlength * style * few fixes * forward passes * fixup * fixup * some clean up * remove mels form the dictionnary * pad after the repeat * update padding when dsmaller * fix padding * style * use swin patch merging * use copied from swin * processor with any tokenizer * more copied from * some clean up * more refactor * fix mel when rand_trunc * style * remove unused imports * update processing * remove image processing tests * add testing fiel * fixmodeling issues * replace with `is_longer` * clap in serialization * more refactor * `make fixup` * make fixup * fix feature extractor * update test feature extractor * `make fixup` * clean up config * more clean up * more cleanup * update tests * refactor tests and inits * removeCLAP vision config * remove CLAP from image procssing auto and dummy vision objects * update inits * style * re order classes in modeling clap * Use roberta tokenizer as the other weights are not open sourced * small cleaup * remove tokenization CLAP * processor tokenizr is roberta * update feature extraction doc * remove vclap from model zero shot * update f_min and f_max to frequency_xx * some changes - fix modeling keys - add `is_longer` in the forward pass - make fixup * make fixup * consistent behavior ebtween rand_crop and fusion * add numpy resize and bilinear and documentation * move resizing to image utils * clean feature extraction * import resize from correct file * resize in image transforms * update * style * style * nit * remove unused arguments form the feature extractor * style * few fixes + make fixup * oops * fix more tests * add zero shot audio classification pipeline * update zeroshot classification pipeline * fixup * fix copies * all CI tests pass * make fixup + fix docs * fix docs * fix docs * update tests pip;eline * update zero shot pipeline * update feature extraction clap * update tokenization auto * use nested simplify * update pipeline tests * Apply suggestions from code review Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * split in two lines * fixes * refactor * clean up * add integration tests * update config docstring * style * update processor * fix processor test * fix feat extractor tests * update docs * Apply suggestions from code review Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fix readmes * fix tips * Update src/transformers/models/auto/configuration_auto.py * update doc and remove todo -> properly explained * fix idx and typo * typoe * cleanup config * cleanup tests, styles and doc * ignore docstyle on image transform * add conversion script * remove the `clap` indx in favor of `CLAP` * update __init * nits * Update src/transformers/pipelines/__init__.py * fix bug * clarifiy config * fix copy * fix init * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * fix model output * fix comment * make fixup * make fixup * rename to `Clap` * replace to `Clap` * replace to `Clap` * repo consistency * again repo-consistency * make fixup * Apply suggestions from code review Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * add config * changes * update conversion * Apply suggestions from code review Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * remove unused function * update based on code reviews * style * more comments * cleanup * clean up * style * apply suggestions * Empty commit * pipeline will be added in a different PR * update calls to audio utils functions * update pipeline init * style * style * styling again * use pad * fix repo-consistency * update utils and add doc for audio utils * clean up resize by using torch. update inits accordingly * style * CLap's tokenizer is RobertA * add audio utils to internal toctreee * update totctree * style * update documentation and normalize naming accross audio utils and feature extraction clap * style * clean up * update doc and typos * fix doctest * update modelin code, got rid of a lot of reshaping * style on added doc audio utils * update modeling clap * style * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * docstringvariables with CLAP * rename key * update modeling CLAP * update audio utils docstring * update processing clap * fix readmes * fix toctree * udpate configuration clap * fix init * make fixup * fix * fix * update naming * update * update checkpoint path * Apply suggestions from code review * Major refactoring * Update src/transformers/models/clap/configuration_clap.py * merge --------- Co-authored-by: younesbelkada <younesbelkada@gmail.com> Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

[CLAP] Add CLAP to the library (#21370)
* add model like clip * update * text model ok * clap text works * some refactor - `CLAPVision` to `CLAPAudio` - refactor kwargs of audio modules * more refactor * more refactor * more refactor * correct fusion * more refactor * new modules * add basic processor * fixup * remove whisper copioed from * audio logits match * add doc * correct filters mel and add maxlength * style * few fixes * forward passes * fixup * fixup * some clean up * remove mels form the dictionnary * pad after the repeat * update padding when dsmaller * fix padding * style * use swin patch merging * use copied from swin * processor with any tokenizer * more copied from * some clean up * more refactor * fix mel when rand_trunc * style * remove unused imports * update processing * remove image processing tests * add testing fiel * fixmodeling issues * replace with `is_longer` * clap in serialization * more refactor * `make fixup` * make fixup * fix feature extractor * update test feature extractor * `make fixup` * clean up config * more clean up * more cleanup * update tests * refactor tests and inits * removeCLAP vision config * remove CLAP from image procssing auto and dummy vision objects * update inits * style * re order classes in modeling clap * Use roberta tokenizer as the other weights are not open sourced * small cleaup * remove tokenization CLAP * processor tokenizr is roberta * update feature extraction doc * remove vclap from model zero shot * update f_min and f_max to frequency_xx * some changes - fix modeling keys - add `is_longer` in the forward pass - make fixup * make fixup * consistent behavior ebtween rand_crop and fusion * add numpy resize and bilinear and documentation * move resizing to image utils * clean feature extraction * import resize from correct file * resize in image transforms * update * style * style * nit * remove unused arguments form the feature extractor * style * few fixes + make fixup * oops * fix more tests * add zero shot audio classification pipeline * update zeroshot classification pipeline * fixup * fix copies * all CI tests pass * make fixup + fix docs * fix docs * fix docs * update tests pip;eline * update zero shot pipeline * update feature extraction clap * update tokenization auto * use nested simplify * update pipeline tests * Apply suggestions from code review Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * split in two lines * fixes * refactor * clean up * add integration tests * update config docstring * style * update processor * fix processor test * fix feat extractor tests * update docs * Apply suggestions from code review Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fix readmes * fix tips * Update src/transformers/models/auto/configuration_auto.py * update doc and remove todo -> properly explained * fix idx and typo * typoe * cleanup config * cleanup tests, styles and doc * ignore docstyle on image transform * add conversion script * remove the `clap` indx in favor of `CLAP` * update __init * nits * Update src/transformers/pipelines/__init__.py * fix bug * clarifiy config * fix copy * fix init * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * fix model output * fix comment * make fixup * make fixup * rename to `Clap` * replace to `Clap` * replace to `Clap` * repo consistency * again repo-consistency * make fixup * Apply suggestions from code review Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * add config * changes * update conversion * Apply suggestions from code review Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * remove unused function * update based on code reviews * style * more comments * cleanup * clean up * style * apply suggestions * Empty commit * pipeline will be added in a different PR * update calls to audio utils functions * update pipeline init * style * style * styling again * use pad * fix repo-consistency * update utils and add doc for audio utils * clean up resize by using torch. update inits accordingly * style * CLap's tokenizer is RobertA * add audio utils to internal toctreee * update totctree * style * update documentation and normalize naming accross audio utils and feature extraction clap * style * clean up * update doc and typos * fix doctest * update modelin code, got rid of a lot of reshaping * style on added doc audio utils * update modeling clap * style * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * docstringvariables with CLAP * rename key * update modeling CLAP * update audio utils docstring * update processing clap * fix readmes * fix toctree * udpate configuration clap * fix init * make fixup * fix * fix * update naming * update * update checkpoint path * Apply suggestions from code review * Major refactoring * Update src/transformers/models/clap/configuration_clap.py * merge --------- Co-authored-by: younesbelkada <younesbelkada@gmail.com> Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
c236a621 · Arthur · GitHub · 6b0257de · c236a621 · c236a621
Unverified Commit c236a621 authored Feb 16, 2023 by Arthur Committed by GitHub Feb 16, 2023
20 changed files
--- a/README.md
+++ b/README.md
@@ -295,6 +295,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
 1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
 1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.
 1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
+1. **[CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.

--- a/README_es.md
+++ b/README_es.md
@@ -288,6 +288,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
 1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
 1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.
 1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
+1. **[CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.

--- a/README_hd.md
+++ b/README_hd.md
@@ -260,6 +260,7 @@ conda install -c huggingface transformers
 1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (इनरिया/फेसबुक/सोरबोन से) साथ में कागज [CamemBERT: एक टेस्टी फ्रेंच लैंग्वेज मॉडल](https:// arxiv.org/abs/1911.03894) लुई मार्टिन*, बेंजामिन मुलर*, पेड्रो जेवियर ऑर्टिज़ सुआरेज़*, योआन ड्यूपॉन्ट, लॉरेंट रोमरी, एरिक विलेमोन्टे डे ला क्लर्जरी, जैमे सेडाह और बेनोइट सगोट द्वारा।
 1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (Google रिसर्च से) साथ में दिया गया पेपर [कैनाइन: प्री-ट्रेनिंग ए एफिशिएंट टोकनाइजेशन-फ्री एनकोडर फॉर लैंग्वेज रिप्रेजेंटेशन]( https://arxiv.org/abs/2103.06874) जोनाथन एच क्लार्क, डैन गैरेट, यूलिया टर्क, जॉन विएटिंग द्वारा।
 1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
+1. **[CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)** (LAION-AI से) Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov. द्वाराअनुसंधान पत्र [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687) के साथ जारी किया गया
 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (OpenAI से) साथ वाला पेपर [लर्निंग ट्रांसफरेबल विजुअल मॉडल फ्रॉम नेचुरल लैंग्वेज सुपरविजन](https://arxiv.org /abs/2103.00020) एलेक रैडफोर्ड, जोंग वूक किम, क्रिस हैलासी, आदित्य रमेश, गेब्रियल गोह, संध्या अग्रवाल, गिरीश शास्त्री, अमांडा एस्केल, पामेला मिश्किन, जैक क्लार्क, ग्रेचेन क्रुएगर, इल्या सुत्स्केवर द्वारा।
 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (सेल्सफोर्स से) साथ में पेपर [प्रोग्राम सिंथेसिस के लिए एक संवादात्मक प्रतिमान](https://arxiv.org/abs/2203.13474) एरिक निजकैंप, बो पैंग, हिरोआकी हयाशी, लिफू तू, हुआन वांग, यिंगबो झोउ, सिल्वियो सावरेस, कैमिंग जिओंग रिलीज।

--- a/README_ja.md
+++ b/README_ja.md
@@ -322,6 +322,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
 1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (Inria/Facebook/Sorbonne から) Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot から公開された研究論文: [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894)
 1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (Google Research から) Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting から公開された研究論文: [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874)
 1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (OFA-Sys から) An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou から公開された研究論文: [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335)
+1. **[CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)** (LAION-AI から) Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov. から公開された研究論文 [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687)
 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (OpenAI から) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever から公開された研究論文: [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020)
 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (University of Göttingen から) Timo Lüddecke and Alexander Ecker から公開された研究論文: [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003)
 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (Salesforce から) Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong から公開された研究論文: [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474)

--- a/README_ko.md
+++ b/README_ko.md
@@ -237,6 +237,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
 1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (Inria/Facebook/Sorbonne 에서) Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot 의 [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) 논문과 함께 발표했습니다.
 1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (Google Research 에서) Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting 의 [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) 논문과 함께 발표했습니다.
 1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (OFA-Sys 에서) An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou 의 [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) 논문과 함께 발표했습니다.
+1. **[CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)** (LAION-AI 에서 제공)은 Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.의 [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687)논문과 함께 발표했습니다.
 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (OpenAI 에서) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever 의 [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) 논문과 함께 발표했습니다.
 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (University of Göttingen 에서) Timo Lüddecke and Alexander Ecker 의 [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) 논문과 함께 발표했습니다.
 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (Salesforce 에서) Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong 의 [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) 논문과 함께 발표했습니다.

--- a/README_zh-hans.md
+++ b/README_zh-hans.md
@@ -261,6 +261,7 @@ conda install -c huggingface transformers
 1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (来自 Inria/Facebook/Sorbonne) 伴随论文 [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) 由 Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot 发布。
 1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (来自 Google Research) 伴随论文 [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) 由 Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting 发布。
 1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (来自 OFA-Sys) 伴随论文 [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) 由 An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou 发布。
+1. **[CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)** (来自 LAION-AI) 伴随论文 [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687) 由 Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov 发布。
 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (来自 OpenAI) 伴随论文 [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) 由 Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever 发布。
 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (来自 University of Göttingen) 伴随论文 [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) 由 Timo Lüddecke and Alexander Ecker 发布。
 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (来自 Salesforce) 伴随论文 [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) 由 Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong 发布。

--- a/README_zh-hant.md
+++ b/README_zh-hant.md
@@ -273,6 +273,7 @@ conda install -c huggingface transformers
 1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
 1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.
 1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
+1. **[CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.

--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -495,6 +495,8 @@
      sections:
      - local: model_doc/audio-spectrogram-transformer
        title: Audio Spectrogram Transformer
+      - local: model_doc/clap
+        title: CLAP
      - local: model_doc/hubert
        title: Hubert
      - local: model_doc/mctct
@@ -622,6 +624,8 @@
      title: Utilities for Generation
    - local: internal/image_processing_utils
      title: Utilities for Image Processors
+    - local: internal/audio_utils
+      title: Utilities for Audio processing
    - local: internal/file_utils
      title: General Utilities
    title: Internal Helpers

--- a/docs/source/en/index.mdx
+++ b/docs/source/en/index.mdx
@@ -74,6 +74,7 @@ The documentation is organized into five sections:
 1. **[CamemBERT](model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
 1. **[CANINE](model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.
 1. **[Chinese-CLIP](model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
+1. **[CLAP](model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
 1. **[CLIP](model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
 1. **[CLIPSeg](model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
 1. **[CodeGen](model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
@@ -263,6 +264,7 @@ Flax), PyTorch, and/or TensorFlow.
 |           CamemBERT           |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
 |            CANINE             |       ✅       |       ❌       |       ✅        |         ❌         |      ❌      |
 |         Chinese-CLIP          |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
+|             CLAP              |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
 |             CLIP              |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
 |            CLIPSeg            |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
 |            CodeGen            |       ✅       |       ✅       |       ✅        |         ❌         |      ❌      |

--- a/docs/source/en/internal/audio_utils.mdx
+++ b/docs/source/en/internal/audio_utils.mdx
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# Utilities for `FeatureExtractors`
+This page lists all the utility functions that can be used by the audio [`FeatureExtractor`] in order to compute special features from a raw audio using common algorithms such as *Short Time Fourier Transform* or *Mel log spectrogram*.
+Most of those are only useful if you are studying the code of the image processors in the library.
+## Audio Transformations
+[[autodoc]] audio_utils.hertz_to_mel
+[[autodoc]] audio_utils.mel_to_hertz
+[[autodoc]] audio_utils.get_mel_filter_banks
+[[autodoc]] audio_utils.stft
+[[autodoc]] audio_utils.power_to_db
+[[autodoc]] audio_utils.fram_wave
--- a/docs/source/en/model_doc/clap.mdx
+++ b/docs/source/en/model_doc/clap.mdx
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# CLAP
+## Overview
+The CLAP model was proposed in [Large Scale Constrastive Laungaue-Audio pretraining with
+feature fusion and keyword-to-caption augmentation](https://arxiv.org/pdf/2211.06687.pdf) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
+CLAP (Constrastive Laungaue-Audio Pretraining) is a neural network trained on a variety of (audio, text) pairs. It can be instructed in to predict the most relevant text snippet, given an audio, without directly optimizing for the task. The CLAP model uses a SWINTransformer to get audio features from a log-Mel spectrogram input, and a RoBERTa model to get text features. Both the text and audio features are then projected to a latent space with identical dimension. The dot product between the projected audio and text features is then used as a similar score.
+The abstract from the paper is the following:
+*Contrastive learning has shown remarkable success in the field of multimodal representation learning. In this paper, we propose a pipeline of contrastive language-audio pretraining to develop an audio representation by combining audio data with natural language descriptions. To accomplish this target, we first release LAION-Audio-630K, a large collection of 633,526 audio-text pairs from different data sources. Second, we construct a contrastive language-audio pretraining model by considering different audio encoders and text encoders. We incorporate the feature fusion mechanism and keyword-to-caption augmentation into the model design to further enable the model to process audio inputs of variable lengths and enhance the performance. Third, we perform comprehensive experiments to evaluate our model across three tasks: text-to-audio retrieval, zero-shot audio classification, and supervised audio classification. The results demonstrate that our model achieves superior performance in text-to-audio retrieval task. In audio classification tasks, the model achieves state-of-the-art performance in the zeroshot setting and is able to obtain performance comparable to models' results in the non-zero-shot setting. LAION-Audio-6*
+This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada) and [Arthur Zucker](https://huggingface.co/ArtZucker) .
+The original code can be found [here](https://github.com/LAION-AI/Clap).
+## ClapConfig
+[[autodoc]] ClapConfig
+    - from_text_audio_configs
+## ClapTextConfig
+[[autodoc]] ClapTextConfig
+## ClapAudioConfig
+[[autodoc]] ClapAudioConfig
+## ClapFeatureExtractor
+[[autodoc]] ClapFeatureExtractor
+## ClapProcessor
+[[autodoc]] ClapProcessor
+## ClapModel
+[[autodoc]] ClapModel
+    - forward
+    - get_text_features
+    - get_audio_features
+## ClapTextModel
+[[autodoc]] ClapTextModel
+    - forward
+## ClapTextModelWithProjection
+[[autodoc]] ClapTextModelWithProjection
+    - forward
+## ClapAudioModel
+[[autodoc]] ClapAudioModel
+    - forward
+## ClapAudioModelWithProjection
+[[autodoc]] ClapAudioModelWithProjection
+    - forward
--- a/src/transformers/__init__.py
+++ b/src/transformers/__init__.py
@@ -47,6 +47,7 @@ logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
 # Base objects, independent of any specific backend
 _import_structure = {
+    "audio_utils": [],
    "benchmark": [],
    "commands": [],
    "configuration_utils": ["PretrainedConfig"],
@@ -206,6 +207,13 @@ _import_structure = {
        "ChineseCLIPTextConfig",
        "ChineseCLIPVisionConfig",
    ],
+    "models.clap": [
+        "CLAP_PRETRAINED_MODEL_ARCHIVE_LIST",
+        "ClapAudioConfig",
+        "ClapConfig",
+        "ClapProcessor",
+        "ClapTextConfig",
+    ],
    "models.clip": [
        "CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP",
        "CLIPConfig",
@@ -1231,6 +1239,18 @@ else:
            "ChineseCLIPVisionModel",
        ]
    )
+    _import_structure["models.clap"].extend(
+        [
+            "CLAP_PRETRAINED_MODEL_ARCHIVE_LIST",
+            "ClapAudioModel",
+            "ClapAudioModelWithProjection",
+            "ClapFeatureExtractor",
+            "ClapModel",
+            "ClapPreTrainedModel",
+            "ClapTextModel",
+            "ClapTextModelWithProjection",
+        ]
+    )
    _import_structure["models.clip"].extend(
        [
            "CLIP_PRETRAINED_MODEL_ARCHIVE_LIST",
@@ -3730,6 +3750,13 @@ if TYPE_CHECKING:
        ChineseCLIPTextConfig,
        ChineseCLIPVisionConfig,
    )
+    from .models.clap import (
+        CLAP_PRETRAINED_MODEL_ARCHIVE_LIST,
+        ClapAudioConfig,
+        ClapConfig,
+        ClapProcessor,
+        ClapTextConfig,
+    )
    from .models.clip import (
        CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP,
        CLIPConfig,
@@ -4623,6 +4650,16 @@ if TYPE_CHECKING:
            ChineseCLIPTextModel,
            ChineseCLIPVisionModel,
        )
+        from .models.clap import (
+            CLAP_PRETRAINED_MODEL_ARCHIVE_LIST,
+            ClapAudioModel,
+            ClapAudioModelWithProjection,
+            ClapFeatureExtractor,
+            ClapModel,
+            ClapPreTrainedModel,
+            ClapTextModel,
+            ClapTextModelWithProjection,
+        )
        from .models.clip import (
            CLIP_PRETRAINED_MODEL_ARCHIVE_LIST,
            CLIPModel,

--- a/src/transformers/audio_utils.py
+++ b/src/transformers/audio_utils.py
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+ Audio processing functions to extract feature from a raw audio. Should all be in numpy to support all frameworks, and
+ remmove unecessary dependencies.
+"""
+import math
+import warnings
+from typing import Optional
+import numpy as np
+from numpy.fft import fft
+def hertz_to_mel(freq: float, mel_scale: str = "htk") -> float:
+    """Convert Hertz to Mels.
+    Args:
+        freqs (`float`):
+            Frequencies in Hertz
+        mel_scale (`str`, *optional*, defaults to `"htk"`):
+            Scale to use, `htk` or `slaney`.
+    Returns:
+        mels (`float`):
+            Frequency in Mels
+    """
+    if mel_scale not in ["slaney", "htk"]:
+        raise ValueError('mel_scale should be one of "htk" or "slaney".')
+    if mel_scale == "htk":
+        return 2595.0 * math.log10(1.0 + (freq / 700.0))
+    # Fill in the linear part
+    frequency_min = 0.0
+    f_sp = 200.0 / 3
+    mels = (freq - frequency_min) / f_sp
+    # Fill in the log-scale part
+    min_log_hertz = 1000.0
+    min_log_mel = (min_log_hertz - frequency_min) / f_sp
+    logstep = math.log(6.4) / 27.0
+    if freq >= min_log_hertz:
+        mels = min_log_mel + math.log(freq / min_log_hertz) / logstep
+    return mels
+def mel_to_hertz(mels: np.array, mel_scale: str = "htk") -> np.array:
+    """Convert mel bin numbers to frequencies.
+    Args:
+        mels (`np.array`):
+            Mel frequencies
+        mel_scale (`str`, *optional*, `"htk"`):
+            Scale to use: `htk` or `slaney`.
+    Returns:
+        freqs (`np.array`):
+            Mels converted to Hertz
+    """
+    if mel_scale not in ["slaney", "htk"]:
+        raise ValueError('mel_scale should be one of "htk" or "slaney".')
+    if mel_scale == "htk":
+        return 700.0 * (10.0 ** (mels / 2595.0) - 1.0)
+    # Fill in the linear scale
+    frequency_min = 0.0
+    f_sp = 200.0 / 3
+    freqs = frequency_min + f_sp * mels
+    # And now the nonlinear scale
+    min_log_hertz = 1000.0
+    min_log_mel = (min_log_hertz - frequency_min) / f_sp
+    logstep = math.log(6.4) / 27.0
+    log_t = mels >= min_log_mel
+    freqs[log_t] = min_log_hertz * np.exp(logstep * (mels[log_t] - min_log_mel))
+    return freqs
+def _create_triangular_filterbank(
+    all_freqs: np.array,
+    f_pts: np.array,
+) -> np.array:
+    """Create a triangular filter bank.
+    Args:
+        all_freqs (`np.array` of shape (`nb_frequency_bins`, )):
+            Discrete frequencies used when the STFT was computed.
+        f_pts (`np.array`, of shape (`nb_mel_filters`, )):
+            Coordinates of the middle points of the triangular filters to create.
+    Returns:
+        fb (np.array):
+            The filter bank of size (`nb_frequency_bins`, `nb_mel_filters`).
+    """
+    # Adapted from Librosa
+    # calculate the difference between each filter mid point and each stft freq point in hertz
+    f_diff = f_pts[1:] - f_pts[:-1]  # (n_filter + 1)
+    slopes = np.expand_dims(f_pts, 0) - np.expand_dims(all_freqs, 1)  # (nb_frequency_bins, n_filter + 2)
+    # create overlapping triangles
+    zero = np.zeros(1)
+    down_slopes = (-1.0 * slopes[:, :-2]) / f_diff[:-1]  # (nb_frequency_bins, n_filter)
+    up_slopes = slopes[:, 2:] / f_diff[1:]  # (nb_frequency_bins, n_filter)
+    fb = np.maximum(zero, np.minimum(down_slopes, up_slopes))
+    return fb
+def get_mel_filter_banks(
+    nb_frequency_bins: int,
+    nb_mel_filters: int,
+    frequency_min: float,
+    frequency_max: float,
+    sample_rate: int,
+    norm: Optional[str] = None,
+    mel_scale: str = "htk",
+) -> np.array:
+    """
+    Create a frequency bin conversion matrix used to obtain the Mel Spectrogram. This is called a *mel filter bank*,
+    and various implementation exist, which differ in the number of filters, the shape of the filters, the way the
+    filters are spaced, the bandwidth of the filters, and the manner in which the spectrum is warped. The goal of these
+    features is to approximate the non-linear human perception of the variation in pitch with respect to the frequency.
+    This code is heavily inspired from the *torchaudio* implementation, see
+    [here](https://pytorch.org/audio/stable/transforms.html) for more details.
+    Tips:
+        - Different banks of Mel filters were introduced in the litterature. The following variation are supported:
+            - MFCC FB-20: introduced in 1980 by Davis and Mermelstein, it assumes a sampling frequency of 10 kHertz
+            and a speech bandwidth of `[0, 4600]` Hertz
+            - MFCC FB-24 HTK: from the Cambridge HMM Toolkit (HTK) (1995) uses a filter bank of 24 filters for a
+            speech bandwidth `[0, 8000]` Hertz (sampling rate ≥ 16 kHertz).
+            - MFCC FB-40: from the Auditory Toolbox for MATLAB written by Slaney in 1998, assumes a sampling rate
+            of 16 kHertz, and speech bandwidth [133, 6854] Hertz. This version also includes an area normalization.
+            - HFCC-E FB-29 (Human Factor Cepstral Coefficients) of Skowronski and Harris (2004), assumes sampling
+            rate of 12.5 kHertz and speech bandwidth [0, 6250] Hertz
+        - The default parameters of `torchaudio`'s mel filterbanks implement the `"htk"` filers while `torchlibrosa`
+        uses the `"slaney"` implementation.
+    Args:
+        nb_frequency_bins (`int`):
+            Number of frequencies used to compute the spectrogram (should be the same as in `stft`).
+        nb_mel_filters (`int`):
+            Number of Mel filers to generate.
+        frequency_min (`float`):
+            Minimum frequency of interest(Hertz).
+        frequency_max (`float`):
+            Maximum frequency of interest(Hertz).
+        sample_rate (`int`):
+            Sample rate of the audio waveform.
+        norm (`str`, *optional*):
+            If "slaney", divide the triangular Mel weights by the width of the mel band (area normalization).
+        mel_scale (`str`, *optional*, defaults to `"htk"`):
+            Scale to use: `"htk"` or `"slaney"`.
+    Returns:
+        `np.ndarray`: Triangular filter banks (fb matrix) of shape (`nb_frequency_bins`, `nb_mel_filters`). This matrix
+        is a projection matrix to go from a spectrogram to a Mel Spectrogram.
+    """
+    if norm is not None and norm != "slaney":
+        raise ValueError('norm must be one of None or "slaney"')
+    # freqency bins
+    all_freqs = np.linspace(0, sample_rate // 2, nb_frequency_bins)
+    # Compute mim and max frequencies in mel scale
+    m_min = hertz_to_mel(frequency_min, mel_scale=mel_scale)
+    m_max = hertz_to_mel(frequency_max, mel_scale=mel_scale)
+    # create the centers of the triangular mel filters.
+    m_pts = np.linspace(m_min, m_max, nb_mel_filters + 2)
+    f_pts = mel_to_hertz(m_pts, mel_scale=mel_scale)
+    # create the filterbank
+    filterbank = _create_triangular_filterbank(all_freqs, f_pts)
+    if norm is not None and norm == "slaney":
+        # Slaney-style mel is scaled to be approx constant energy per channel
+        enorm = 2.0 / (f_pts[2 : nb_mel_filters + 2] - f_pts[:nb_mel_filters])
+        filterbank *= np.expand_dims(enorm, 0)
+    if (filterbank.max(axis=0) == 0.0).any():
+        warnings.warn(
+            "At least one mel filterbank has all zero values. "
+            f"The value for `nb_mel_filters` ({nb_mel_filters}) may be set too high. "
+            f"Or, the value for `nb_frequency_bins` ({nb_frequency_bins}) may be set too low."
+        )
+    return filterbank
+def power_to_db(mel_spectrogram, top_db=None, a_min=1e-10, ref=1.0):
+    """
+    Convert a mel spectrogram from power to db scale, this function is the numpy implementation of librosa.power_to_lb.
+    It computes `10 * log10(mel_spectrogram / ref)`, using basic log properties for stability.
+    Tips:
+        - The motivation behind applying the log function on the mel spectrogram is that humans do not hear loudness on
+          a
+        linear scale. Generally to double the percieved volume of a sound we need to put 8 times as much energy into
+        it.
+        - This means that large variations in energy may not sound all that different if the sound is loud to begin
+        with. This compression operation makes the mel features match more closely what humans actually hear.
+    Args:
+        mel_spectrogram (`np.array`):
+            Input mel spectrogram.
+        top_db (`int`, *optional*):
+            The maximum decibel value.
+        a_min (`int`, *optional*, default to 1e-10):
+            Minimum value to use when cliping the mel spectrogram.
+        ref (`float`, *optional*, default to 1.0):
+            Maximum reference value used to scale the mel_spectrogram.
+    """
+    log_spec = 10 * np.log10(np.clip(mel_spectrogram, a_min=a_min, a_max=None))
+    log_spec -= 10.0 * np.log10(np.maximum(a_min, ref))
+    if top_db is not None:
+        if top_db < 0:
+            raise ValueError("top_db must be non-negative")
+        log_spec = np.clip(log_spec, min=np.maximum(log_spec) - top_db, max=np.inf)
+    return log_spec
+# TODO @ArthurZucker: This method does not support batching yet as we are mainly focus on inference.
+def fram_wave(waveform: np.array, hop_length: int = 160, fft_window_size: int = 400, center: bool = True):
+    """
+    In order to compute the short time fourier transform, the waveform needs to be split in overlapping windowed
+    segments called `frames`.
+    The window length (window_length) defines how much of the signal is contained in each frame, while the hop length
+    defines the step between the beginning of each new frame.
+    Args:
+        waveform (`np.array` of shape `(sample_length,)`):
+            The raw waveform which will be split into smaller chunks.
+        hop_length (`int`, *optional*, defaults to 160):
+            Step between each window of the waveform.
+        fft_window_size (`int`, *optional*, defaults to 400):
+            Defines the size of the window.
+        center (`bool`, defaults to `True`):
+            Whether or not to center each frame around the middle of the frame. Centering is done by reflecting the
+            waveform on the left and on the right.
+    Return:
+        framed_waveform (`np.array` of shape `(waveform.shape // hop_length , fft_window_size)`):
+            The framed waveforms that can be fed to `np.fft`.
+    """
+    frames = []
+    for i in range(0, waveform.shape[0] + 1, hop_length):
+        if center:
+            half_window = (fft_window_size - 1) // 2 + 1
+            start = i - half_window if i > half_window else 0
+            end = i + half_window if i < waveform.shape[0] - half_window else waveform.shape[0]
+            frame = waveform[start:end]
+            if start == 0:
+                padd_width = (-i + half_window, 0)
+                frame = np.pad(frame, pad_width=padd_width, mode="reflect")
+            elif end == waveform.shape[0]:
+                padd_width = (0, (i - waveform.shape[0] + half_window))
+                frame = np.pad(frame, pad_width=padd_width, mode="reflect")
+        else:
+            frame = waveform[i : i + fft_window_size]
+            frame_width = frame.shape[0]
+            if frame_width < waveform.shape[0]:
+                frame = np.lib.pad(
+                    frame, pad_width=(0, fft_window_size - frame_width), mode="constant", constant_values=0
+                )
+        frames.append(frame)
+    frames = np.stack(frames, 0)
+    return frames
+# TODO @ArthurZucker: This method does not support batching yet as we are mainly focus on inference.
+def stft(frames: np.array, windowing_function: np.array, fft_window_size: int = None):
+    """
+    Calculates the complex Short-Time Fourier Transform (STFT) of the given framed signal. Should give the same results
+    as `torch.stft`.
+    Args:
+        frames (`np.array` of dimension `(num_frames, fft_window_size)`):
+            A framed audio signal obtained using `audio_utils.fram_wav`.
+        windowing_function (`np.array` of dimension `(nb_frequency_bins, nb_mel_filters)`:
+            A array reprensenting the function that will be used to reduces the amplitude of the discontinuities at the
+            boundaries of each frame when computing the STFT. Each frame will be multiplied by the windowing_function.
+            For more information on the discontinuities, called *Spectral leakage*, refer to [this
+            tutorial]https://download.ni.com/evaluation/pxi/Understanding%20FFTs%20and%20Windowing.pdf
+        fft_window_size (`int`, *optional*):
+            Size of the window om which the Fourier transform is applied. This controls the frequency resolution of the
+            spectrogram. 400 means that the fourrier transform is computed on windows of 400 samples. The number of
+            frequency bins (`nb_frequency_bins`) used to divide the window into equal strips is equal to
+            `(1+fft_window_size)//2`. An increase of the fft_window_size slows the calculus time proportionnally.
+    Example:
+    ```python
+    >>> from transformers.audio_utils import stft, fram_wave
+    >>> import numpy as np
+    >>> audio = np.random.rand(50)
+    >>> fft_window_size = 10
+    >>> hop_length = 2
+    >>> framed_audio = fram_wave(audio, hop_length, fft_window_size)
+    >>> spectrogram = stft(framed_audio, np.hanning(fft_window_size + 1))
+    ```
+    Returns:
+        spectrogram (`np.ndarray`):
+            A spectrogram of shape `(num_frames, nb_frequency_bins)` obtained using the STFT algorithm
+    """
+    frame_size = frames.shape[1]
+    if fft_window_size is None:
+        fft_window_size = frame_size
+    if fft_window_size < frame_size:
+        raise ValueError("FFT size must greater or equal the frame size")
+    # number of FFT bins to store
+    nb_frequency_bins = (fft_window_size >> 1) + 1
+    spectrogram = np.empty((len(frames), nb_frequency_bins), dtype=np.complex64)
+    fft_signal = np.zeros(fft_window_size)
+    for f, frame in enumerate(frames):
+        if windowing_function is not None:
+            np.multiply(frame, windowing_function, out=fft_signal[:frame_size])
+        else:
+            fft_signal[:frame_size] = frame
+        spectrogram[f] = fft(fft_signal, axis=0)[:nb_frequency_bins]
+    return spectrogram.T
--- a/src/transformers/feature_extraction_sequence_utils.py
+++ b/src/transformers/feature_extraction_sequence_utils.py
@@ -235,11 +235,13 @@ class SequenceFeatureExtractor(FeatureExtractionMixin):
        Pad inputs (on left/right and up to predefined length or max length in the batch)
        Args:
-            processed_features:
+            processed_features (`Union[Dict[str, np.ndarray], BatchFeature]`):
                Dictionary of input values (`np.ndarray[float]`) / input vectors (`List[np.ndarray[float]]`) or batch
                of inputs values (`List[np.ndarray[int]]`) / input vectors (`List[np.ndarray[int]]`)
-            max_length: maximum length of the returned list and optionally padding length (see below)
+            max_length (`int`, *optional*):
-            padding_strategy: PaddingStrategy to use for padding.
+                Maximum length of the returned list and optionally padding length (see below)
+            padding_strategy (`PaddingStrategy`, *optional*, default to `PaddingStrategy.DO_NOT_PAD`):
+                PaddingStrategy to use for padding.
                - PaddingStrategy.LONGEST Pad to the longest sequence in the batch
                - PaddingStrategy.MAX_LENGTH: Pad to the max length (default)
@@ -248,11 +250,12 @@ class SequenceFeatureExtractor(FeatureExtractionMixin):
                    - 'left': pads on the left of the sequences
                    - 'right': pads on the right of the sequences
-            pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
+            pad_to_multiple_of (`int`, *optional*):
-                This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
+                Integer if set will pad the sequence to a multiple of the provided value. This is especially useful to
-                `>= 7.5` (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128.
+                enable the use of Tensor Core on NVIDIA hardware with compute capability `>= 7.5` (Volta), or on TPUs
-            return_attention_mask:
+                which benefit from having sequence lengths be a multiple of 128.
-                (optional) Set to False to avoid returning attention mask (default: set to model specifics)
+            return_attention_mask (`bool`, *optional*):
+                Set to False to avoid returning attention mask (default: set to model specifics)
        """
        required_input = processed_features[self.model_input_names[0]]
@@ -303,15 +306,17 @@ class SequenceFeatureExtractor(FeatureExtractionMixin):
        Truncate inputs to predefined length or max length in the batch
        Args:
-            processed_features:
+            processed_features(`Union[Dict[str, np.ndarray], BatchFeature]`):
                Dictionary of input values (`np.ndarray[float]`) / input vectors (`List[np.ndarray[float]]`) or batch
                of inputs values (`List[np.ndarray[int]]`) / input vectors (`List[np.ndarray[int]]`)
-            max_length: maximum length of the returned list and optionally padding length (see below)
+            max_length (`int`, *optional*):
-            pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
+                maximum length of the returned list and optionally padding length (see below)
-                This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
+            pad_to_multiple_of (`int`, *optional*) :
-                `>= 7.5` (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128.
+                Integer if set will pad the sequence to a multiple of the provided value. This is especially useful to
-            truncation:
+                enable the use of Tensor Core on NVIDIA hardware with compute capability `>= 7.5` (Volta), or on TPUs
-                (optional) Activates truncation to cut input sequences longer than `max_length` to `max_length`.
+                which benefit from having sequence lengths be a multiple of 128.
+            truncation (`bool`, *optional*):
+                Activates truncation to cut input sequences longer than `max_length` to `max_length`.
        """
        if not truncation:
            return processed_features

--- a/src/transformers/models/__init__.py
+++ b/src/transformers/models/__init__.py
@@ -40,6 +40,7 @@ from . import (
    camembert,
    canine,
    chinese_clip,
+    clap,
    clip,
    clipseg,
    codegen,

--- a/src/transformers/models/auto/configuration_auto.py
+++ b/src/transformers/models/auto/configuration_auto.py
@@ -49,6 +49,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
        ("camembert", "CamembertConfig"),
        ("canine", "CanineConfig"),
        ("chinese_clip", "ChineseCLIPConfig"),
+        ("clap", "ClapConfig"),
        ("clip", "CLIPConfig"),
        ("clipseg", "CLIPSegConfig"),
        ("codegen", "CodeGenConfig"),
@@ -222,6 +223,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
        ("camembert", "CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("canine", "CANINE_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("chinese_clip", "CHINESE_CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
+        ("clap", "CLAP_PRETRAINED_MODEL_ARCHIVE_LIST"),
        ("clip", "CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("clipseg", "CLIPSEG_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("codegen", "CODEGEN_PRETRAINED_CONFIG_ARCHIVE_MAP"),
@@ -385,6 +387,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
        ("camembert", "CamemBERT"),
        ("canine", "CANINE"),
        ("chinese_clip", "Chinese-CLIP"),
+        ("clap", "CLAP"),
        ("clip", "CLIP"),
        ("clipseg", "CLIPSeg"),
        ("codegen", "CodeGen"),

--- a/src/transformers/models/auto/feature_extraction_auto.py
+++ b/src/transformers/models/auto/feature_extraction_auto.py
@@ -40,6 +40,7 @@ FEATURE_EXTRACTOR_MAPPING_NAMES = OrderedDict(
        ("audio-spectrogram-transformer", "ASTFeatureExtractor"),
        ("beit", "BeitFeatureExtractor"),
        ("chinese_clip", "ChineseCLIPFeatureExtractor"),
+        ("clap", "ClapFeatureExtractor"),
        ("clip", "CLIPFeatureExtractor"),
        ("clipseg", "ViTFeatureExtractor"),
        ("conditional_detr", "ConditionalDetrFeatureExtractor"),

--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@@ -47,6 +47,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
        ("camembert", "CamembertModel"),
        ("canine", "CanineModel"),
        ("chinese_clip", "ChineseCLIPModel"),
+        ("clap", "ClapModel"),
        ("clip", "CLIPModel"),
        ("clipseg", "CLIPSegModel"),
        ("codegen", "CodeGenModel"),

--- a/src/transformers/models/auto/processing_auto.py
+++ b/src/transformers/models/auto/processing_auto.py
@@ -46,6 +46,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
        ("blip-2", "Blip2Processor"),
        ("bridgetower", "BridgeTowerProcessor"),
        ("chinese_clip", "ChineseCLIPProcessor"),
+        ("clap", "ClapProcessor"),
        ("clip", "CLIPProcessor"),
        ("clipseg", "CLIPSegProcessor"),
        ("flava", "FlavaProcessor"),

--- a/src/transformers/models/auto/tokenization_auto.py
+++ b/src/transformers/models/auto/tokenization_auto.py
@@ -91,6 +91,13 @@ else:
            ),
            ("canine", ("CanineTokenizer", None)),
            ("chinese_clip", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
+            (
+                "clap",
+                (
+                    "RobertaTokenizer",
+                    "RobertaTokenizerFast" if is_tokenizers_available() else None,
+                ),
+            ),
            (
                "clip",
                (