Add CLVP (#24745)

* init commit * attention arch done except rotary emb * rotary emb done * text encoder working * outputs matching * arch first pass done * make commands done, tests and docs remaining * all tests passed, only docs remaining * docs done * doc-builder fix * convert script removed(not relevant) * minor comments done * added ckpt conversion script * tokenizer done * very minor fix of index.md 2 * mostly make fixup related * all done except fe and rotary emb * very small change * removed unidecode dependency * style changes * tokenizer removed require_backends * added require_inflect to tokenizer tests * removed VOCAB_FILES in tokenizer test * inflect dependency removed * added rotary pos emb cache and simplified the apply method * style * little doc change * more comments * feature extractor added * added processor * auto-regressive config added * added CLVPConditioningEncoder * comments done except the test one * weights added successfull(NOT tested) * tokenizer fix with numbers * generate outputs matching * almost tests passing Integ tests not written * Integ tests added * major CUDA error fixed * docs done * rebase and multiple fixes * fixed rebase overwrites * generate code simplified and tests for AutoRegressive model added * minor changes * refectored gpt2 code in clvp file * weights done and all code refactored * mostly done except the fast_tokenizer * doc test fix * config file's doc fixes * more config fix * more comments * tokenizer comments mostly done * modeling file mostly refactored and can load modules * ClvpEncoder tested * ClvpDecoder, ClvpModel and ClvpForCausalLM tested * integration and all tests passed * more fixes * docs almost done * ckpt conversion refectored * style and some failing tests fix * comments * temporary output fix but test_assisted_decoding_matches_greedy_search test fails * majority changes done * use_cache outputs same now! Along with the asisted_greedy_decoding test fix * more comments * more comments * prepare_inputs_for_generation fixed and _prepare_model_inputs added * style fix * clvp.md change * moved clvpconditionalencoder norms * add model to new index * added tokenizer input_ids_with_special_tokens * small fix * config mostly done * added config-tester and changed conversion script * more comments * comments * style fix * some comments * tokenizer changed back to prev state * small commnets * added output hidden states for the main model * style fix * comments * small change * revert small change * . * Update clvp.md * Update test_modeling_clvp.py * :) * some minor change * new fixes * remove to_dict from FE

Add CLVP (#24745)
* init commit * attention arch done except rotary emb * rotary emb done * text encoder working * outputs matching * arch first pass done * make commands done, tests and docs remaining * all tests passed, only docs remaining * docs done * doc-builder fix * convert script removed(not relevant) * minor comments done * added ckpt conversion script * tokenizer done * very minor fix of index.md 2 * mostly make fixup related * all done except fe and rotary emb * very small change * removed unidecode dependency * style changes * tokenizer removed require_backends * added require_inflect to tokenizer tests * removed VOCAB_FILES in tokenizer test * inflect dependency removed * added rotary pos emb cache and simplified the apply method * style * little doc change * more comments * feature extractor added * added processor * auto-regressive config added * added CLVPConditioningEncoder * comments done except the test one * weights added successfull(NOT tested) * tokenizer fix with numbers * generate outputs matching * almost tests passing Integ tests not written * Integ tests added * major CUDA error fixed * docs done * rebase and multiple fixes * fixed rebase overwrites * generate code simplified and tests for AutoRegressive model added * minor changes * refectored gpt2 code in clvp file * weights done and all code refactored * mostly done except the fast_tokenizer * doc test fix * config file's doc fixes * more config fix * more comments * tokenizer comments mostly done * modeling file mostly refactored and can load modules * ClvpEncoder tested * ClvpDecoder, ClvpModel and ClvpForCausalLM tested * integration and all tests passed * more fixes * docs almost done * ckpt conversion refectored * style and some failing tests fix * comments * temporary output fix but test_assisted_decoding_matches_greedy_search test fails * majority changes done * use_cache outputs same now! Along with the asisted_greedy_decoding test fix * more comments * more comments * prepare_inputs_for_generation fixed and _prepare_model_inputs added * style fix * clvp.md change * moved clvpconditionalencoder norms * add model to new index * added tokenizer input_ids_with_special_tokens * small fix * config mostly done * added config-tester and changed conversion script * more comments * comments * style fix * some comments * tokenizer changed back to prev state * small commnets * added output hidden states for the main model * style fix * comments * small change * revert small change * . * Update clvp.md * Update test_modeling_clvp.py * :) * some minor change * new fixes * remove to_dict from FE
7e9f10ac · Susnato Dhar · GitHub · 9dd58c53 · 7e9f10ac · 7e9f10ac
Unverified Commit 7e9f10ac authored Nov 10, 2023 by Susnato Dhar Committed by GitHub Nov 10, 2023
12 changed files
--- a/src/transformers/models/clvp/feature_extraction_clvp.py
+++ b/src/transformers/models/clvp/feature_extraction_clvp.py
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Feature extractor class for CLVP
+"""
+from typing import List, Optional, Union
+import numpy as np
+from ...audio_utils import mel_filter_bank, spectrogram, window_function
+from ...feature_extraction_sequence_utils import SequenceFeatureExtractor
+from ...feature_extraction_utils import BatchFeature
+from ...utils import TensorType, logging
+logger = logging.get_logger(__name__)
+class ClvpFeatureExtractor(SequenceFeatureExtractor):
+    r"""
+    Constructs a CLVP feature extractor.
+    This feature extractor inherits from [`~feature_extraction_sequence_utils.SequenceFeatureExtractor`] which contains
+    most of the main methods. Users should refer to this superclass for more information regarding those methods.
+    This class extracts log-mel-spectrogram features from raw speech using a custom numpy implementation of the `Short
+    Time Fourier Transform` which should match pytorch's `torch.stft` equivalent.
+    Args:
+        feature_size (`int`, *optional*, defaults to 80):
+            The feature dimension of the extracted features.
+        sampling_rate (`int`, *optional*, defaults to 22050):
+            The sampling rate at which the audio files should be digitalized expressed in hertz (Hz).
+        default_audio_length (`int`, *optional*, defaults to 6):
+            The default length of raw audio in seconds. If `max_length` is not set during `__call__` then it will
+            automatically be set to default_audio_length * `self.sampling_rate`.
+        hop_length (`int`, *optional*, defaults to 256):
+            Length of the overlaping windows for the STFT used to obtain the Mel Frequency coefficients.
+        chunk_length (`int`, *optional*, defaults to 30):
+            The maximum number of chuncks of `sampling_rate` samples used to trim and pad longer or shorter audio
+            sequences.
+        n_fft (`int`, *optional*, defaults to 1024):
+            Size of the Fourier transform.
+        padding_value (`float`, *optional*, defaults to 0.0):
+            Padding value used to pad the audio. Should correspond to silences.
+        mel_norms (`list` of length `feature_size`, *optional*):
+            If `mel_norms` is provided then it will be used to normalize the log-mel spectrograms along each
+            mel-filter.
+        return_attention_mask (`bool`, *optional*, defaults to `False`):
+            Whether to return the attention mask. If left to the default, it will return the attention mask.
+            [What are attention masks?](../glossary#attention-mask)
+    """
+    model_input_names = ["input_features", "attention_mask"]
+    def __init__(
+        self,
+        feature_size=80,
+        sampling_rate=22050,
+        default_audio_length=6,
+        hop_length=256,
+        chunk_length=30,
+        n_fft=1024,
+        padding_value=0.0,
+        mel_norms=None,
+        return_attention_mask=False,  # pad inputs to max length with silence token (zero) and no attention mask
+        **kwargs,
+    ):
+        super().__init__(
+            feature_size=feature_size,
+            sampling_rate=sampling_rate,
+            padding_value=padding_value,
+            return_attention_mask=return_attention_mask,
+            **kwargs,
+        )
+        self.n_fft = n_fft
+        self.hop_length = hop_length
+        self.chunk_length = chunk_length
+        self.n_samples = chunk_length * sampling_rate
+        self.nb_max_frames = self.n_samples // hop_length
+        self.sampling_rate = sampling_rate
+        self.default_audio_length = default_audio_length
+        self.mel_norms = mel_norms
+        self.mel_filters = mel_filter_bank(
+            num_frequency_bins=1 + (n_fft // 2),
+            num_mel_filters=feature_size,
+            min_frequency=0.0,
+            max_frequency=8000.0,
+            sampling_rate=sampling_rate,
+            norm="slaney",
+            mel_scale="htk",
+        )
+    def _np_extract_fbank_features(self, waveform: np.array) -> np.ndarray:
+        """
+        This method first computes the log-mel spectrogram of the provided audio then applies normalization along the
+        each mel-filterbank, if `mel_norms` is provided.
+        """
+        log_spec = spectrogram(
+            waveform,
+            window_function(self.n_fft, "hann"),
+            frame_length=self.n_fft,
+            hop_length=self.hop_length,
+            power=2.0,
+            mel_filters=self.mel_filters,
+            log_mel=None,
+        )
+        log_spec = np.log(np.clip(log_spec, a_min=1e-5, a_max=None))
+        if self.mel_norms is not None:
+            log_spec = log_spec / np.array(self.mel_norms)[:, None]
+        return log_spec
+    def __call__(
+        self,
+        raw_speech: Union[np.ndarray, List[float], List[np.ndarray], List[List[float]]],
+        sampling_rate: Optional[int] = None,
+        truncation: bool = True,
+        pad_to_multiple_of: Optional[int] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        return_attention_mask: Optional[bool] = True,
+        padding: Optional[str] = "max_length",
+        max_length: Optional[int] = None,
+        **kwargs,
+    ) -> BatchFeature:
+        """
+        `ClvpFeatureExtractor` is used to extract various voice specific properties such as the pitch and tone of the
+        voice, speaking speed, and even speaking defects like a lisp or stuttering from a sample voice or `raw_speech`.
+        First the voice is padded or truncated in a way such that it becomes a waveform of `self.default_audio_length`
+        seconds long and then the log-mel spectrogram is extracted from it.
+        Args:
+            raw_speech (`np.ndarray`, `List[float]`, `List[np.ndarray]`, `List[List[float]]`):
+                The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a list of float
+                values, a list of numpy arrays or a list of list of float values. Must be mono channel audio, not
+                stereo, i.e. single float per timestep.
+            sampling_rate (`int`, *optional*):
+                The sampling rate at which the `raw_speech` input was sampled. It is strongly recommended to pass
+                `sampling_rate` at the forward call to prevent silent errors and allow automatic speech recognition
+                pipeline.
+            truncation (`bool`, *optional*, default to `True`):
+                Activates truncation to cut input sequences longer than *max_length* to *max_length*.
+            pad_to_multiple_of (`int`, *optional*):
+                If set will pad the sequence to a multiple of the provided value.
+                This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
+                `>= 7.5` (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128.
+            return_attention_mask (`bool`, *optional*, defaults to `True`):
+                Whether to return the attention mask. If left to the default, it will return the attention mask.
+                [What are attention masks?](../glossary#attention-mask)
+            return_tensors (`str` or [`~utils.TensorType`], *optional*):
+                If set, will return tensors instead of list of python integers. Acceptable values are:
+                - `'tf'`: Return TensorFlow `tf.constant` objects.
+                - `'pt'`: Return PyTorch `torch.Tensor` objects.
+                - `'np'`: Return Numpy `np.ndarray` objects.
+            padding_value (`float`, defaults to 0.0):
+                The value that is used to fill the padding values / vectors.
+            max_length (`int`, *optional*):
+                The maximum input length of the inputs.
+        """
+        if sampling_rate is not None:
+            if sampling_rate != self.sampling_rate:
+                raise ValueError(
+                    f"The model corresponding to this feature extractor: {self.__class__.__name__} was trained using a"
+                    f" sampling rate of {self.sampling_rate}. Please make sure that the provided `raw_speech` input"
+                    f" was sampled with {self.sampling_rate} and not {sampling_rate}."
+                )
+        else:
+            logger.warning(
+                "It is strongly recommended to pass the `sampling_rate` argument to this function. "
+                "Failing to do so can result in silent errors that might be hard to debug."
+            )
+        is_batched_numpy = isinstance(raw_speech, np.ndarray) and len(raw_speech.shape) > 1
+        if is_batched_numpy and len(raw_speech.shape) > 2:
+            raise ValueError(f"Only mono-channel audio is supported for input to {self}")
+        is_batched = is_batched_numpy or (
+            isinstance(raw_speech, (list, tuple)) and (isinstance(raw_speech[0], (np.ndarray, tuple, list)))
+        )
+        if is_batched:
+            raw_speech = [np.asarray([speech], dtype=np.float32).T for speech in raw_speech]
+        elif not is_batched and not isinstance(raw_speech, np.ndarray):
+            raw_speech = np.asarray(raw_speech, dtype=np.float32)
+        elif isinstance(raw_speech, np.ndarray) and raw_speech.dtype is np.dtype(np.float64):
+            raw_speech = raw_speech.astype(np.float32)
+        # always return batch
+        if not is_batched:
+            raw_speech = [np.asarray([raw_speech]).T]
+        batched_speech = BatchFeature({"input_features": raw_speech})
+        max_length = self.default_audio_length * self.sampling_rate if max_length is None else max_length
+        padded_inputs = self.pad(
+            batched_speech,
+            padding=padding,
+            max_length=max_length,
+            truncation=truncation,
+            pad_to_multiple_of=pad_to_multiple_of,
+            return_attention_mask=return_attention_mask,
+        )
+        # make sure list is in array format
+        input_features = padded_inputs.get("input_features").transpose(2, 0, 1)
+        input_features = [
+            self._np_extract_fbank_features(waveform).astype(np.float32) for waveform in input_features[0]
+        ]
+        if isinstance(input_features[0], List):
+            padded_inputs["input_features"] = [np.asarray(feature) for feature in input_features]
+        else:
+            padded_inputs["input_features"] = input_features
+        return padded_inputs.convert_to_tensors(return_tensors)
--- a/src/transformers/models/clvp/modeling_clvp.py
+++ b/src/transformers/models/clvp/modeling_clvp.py
--- a/src/transformers/models/clvp/number_normalizer.py
+++ b/src/transformers/models/clvp/number_normalizer.py
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""English Normalizer class for CLVP."""
+import re
+class EnglishNormalizer:
+    def __init__(self):
+        # List of (regular expression, replacement) pairs for abbreviations:
+        self._abbreviations = [
+            (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+            for x in [
+                ("mrs", "misess"),
+                ("mr", "mister"),
+                ("dr", "doctor"),
+                ("st", "saint"),
+                ("co", "company"),
+                ("jr", "junior"),
+                ("maj", "major"),
+                ("gen", "general"),
+                ("drs", "doctors"),
+                ("rev", "reverend"),
+                ("lt", "lieutenant"),
+                ("hon", "honorable"),
+                ("sgt", "sergeant"),
+                ("capt", "captain"),
+                ("esq", "esquire"),
+                ("ltd", "limited"),
+                ("col", "colonel"),
+                ("ft", "fort"),
+            ]
+        ]
+        self.ones = ["", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine"]
+        self.teens = [
+            "ten",
+            "eleven",
+            "twelve",
+            "thirteen",
+            "fourteen",
+            "fifteen",
+            "sixteen",
+            "seventeen",
+            "eighteen",
+            "nineteen",
+        ]
+        self.tens = ["", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"]
+    def number_to_words(self, num: int) -> str:
+        """
+        Converts numbers(`int`) to words(`str`).
+        Please note that it only supports upto - "'nine hundred ninety-nine quadrillion, nine hundred ninety-nine
+        trillion, nine hundred ninety-nine billion, nine hundred ninety-nine million, nine hundred ninety-nine
+        thousand, nine hundred ninety-nine'" or `number_to_words(999_999_999_999_999_999)`.
+        """
+        if num == 0:
+            return "zero"
+        elif num < 0:
+            return "minus " + self.number_to_words(abs(num))
+        elif num < 10:
+            return self.ones[num]
+        elif num < 20:
+            return self.teens[num - 10]
+        elif num < 100:
+            return self.tens[num // 10] + ("-" + self.number_to_words(num % 10) if num % 10 != 0 else "")
+        elif num < 1000:
+            return (
+                self.ones[num // 100] + " hundred" + (" " + self.number_to_words(num % 100) if num % 100 != 0 else "")
+            )
+        elif num < 1_000_000:
+            return (
+                self.number_to_words(num // 1000)
+                + " thousand"
+                + (", " + self.number_to_words(num % 1000) if num % 1000 != 0 else "")
+            )
+        elif num < 1_000_000_000:
+            return (
+                self.number_to_words(num // 1_000_000)
+                + " million"
+                + (", " + self.number_to_words(num % 1_000_000) if num % 1_000_000 != 0 else "")
+            )
+        elif num < 1_000_000_000_000:
+            return (
+                self.number_to_words(num // 1_000_000_000)
+                + " billion"
+                + (", " + self.number_to_words(num % 1_000_000_000) if num % 1_000_000_000 != 0 else "")
+            )
+        elif num < 1_000_000_000_000_000:
+            return (
+                self.number_to_words(num // 1_000_000_000_000)
+                + " trillion"
+                + (", " + self.number_to_words(num % 1_000_000_000_000) if num % 1_000_000_000_000 != 0 else "")
+            )
+        elif num < 1_000_000_000_000_000_000:
+            return (
+                self.number_to_words(num // 1_000_000_000_000_000)
+                + " quadrillion"
+                + (
+                    ", " + self.number_to_words(num % 1_000_000_000_000_000)
+                    if num % 1_000_000_000_000_000 != 0
+                    else ""
+                )
+            )
+        else:
+            return "number out of range"
+    def convert_to_ascii(self, text: str) -> str:
+        """
+        Converts unicode to ascii
+        """
+        return text.encode("ascii", "ignore").decode("utf-8")
+    def _expand_dollars(self, m: str) -> str:
+        """
+        This method is used to expand numerical dollar values into spoken words.
+        """
+        match = m.group(1)
+        parts = match.split(".")
+        if len(parts) > 2:
+            return match + " dollars"  # Unexpected format
+        dollars = int(parts[0]) if parts[0] else 0
+        cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0
+        if dollars and cents:
+            dollar_unit = "dollar" if dollars == 1 else "dollars"
+            cent_unit = "cent" if cents == 1 else "cents"
+            return "%s %s, %s %s" % (dollars, dollar_unit, cents, cent_unit)
+        elif dollars:
+            dollar_unit = "dollar" if dollars == 1 else "dollars"
+            return "%s %s" % (dollars, dollar_unit)
+        elif cents:
+            cent_unit = "cent" if cents == 1 else "cents"
+            return "%s %s" % (cents, cent_unit)
+        else:
+            return "zero dollars"
+    def _remove_commas(self, m: str) -> str:
+        """
+        This method is used to remove commas from sentences.
+        """
+        return m.group(1).replace(",", "")
+    def _expand_decimal_point(self, m: str) -> str:
+        """
+        This method is used to expand '.' into spoken word ' point '.
+        """
+        return m.group(1).replace(".", " point ")
+    def _expand_ordinal(self, num: str) -> str:
+        """
+        This method is used to expand ordinals such as '1st', '2nd' into spoken words.
+        """
+        ordinal_suffixes = {1: "st", 2: "nd", 3: "rd"}
+        num = int(num.group(0)[:-2])
+        if 10 <= num % 100 and num % 100 <= 20:
+            suffix = "th"
+        else:
+            suffix = ordinal_suffixes.get(num % 10, "th")
+        return self.number_to_words(num) + suffix
+    def _expand_number(self, m: str) -> str:
+        """
+        This method acts as a preprocessing step for numbers between 1000 and 3000 (same as the original repository,
+        link :
+        https://github.com/neonbjb/tortoise-tts/blob/4003544b6ff4b68c09856e04d3eff9da26d023c2/tortoise/utils/tokenizer.py#L86)
+        """
+        num = int(m.group(0))
+        if num > 1000 and num < 3000:
+            if num == 2000:
+                return "two thousand"
+            elif num > 2000 and num < 2010:
+                return "two thousand " + self.number_to_words(num % 100)
+            elif num % 100 == 0:
+                return self.number_to_words(num // 100) + " hundred"
+            else:
+                return self.number_to_words(num)
+        else:
+            return self.number_to_words(num)
+    def normalize_numbers(self, text: str) -> str:
+        """
+        This method is used to normalize numbers within a text such as converting the numbers to words, removing
+        commas, etc.
+        """
+        text = re.sub(re.compile(r"([0-9][0-9\,]+[0-9])"), self._remove_commas, text)
+        text = re.sub(re.compile(r"£([0-9\,]*[0-9]+)"), r"\1 pounds", text)
+        text = re.sub(re.compile(r"\$([0-9\.\,]*[0-9]+)"), self._expand_dollars, text)
+        text = re.sub(re.compile(r"([0-9]+\.[0-9]+)"), self._expand_decimal_point, text)
+        text = re.sub(re.compile(r"[0-9]+(st|nd|rd|th)"), self._expand_ordinal, text)
+        text = re.sub(re.compile(r"[0-9]+"), self._expand_number, text)
+        return text
+    def expand_abbreviations(self, text: str) -> str:
+        """
+        Expands the abbreviate words.
+        """
+        for regex, replacement in self._abbreviations:
+            text = re.sub(regex, replacement, text)
+        return text
+    def collapse_whitespace(self, text: str) -> str:
+        """
+        Removes multiple whitespaces
+        """
+        return re.sub(re.compile(r"\s+"), " ", text)
+    def __call__(self, text):
+        """
+        Converts text to ascii, numbers / number-like quantities to their spelt-out counterparts and expands
+        abbreviations
+        """
+        text = self.convert_to_ascii(text)
+        text = text.lower()
+        text = self.normalize_numbers(text)
+        text = self.expand_abbreviations(text)
+        text = self.collapse_whitespace(text)
+        text = text.replace('"', "")
+        return text
--- a/src/transformers/models/clvp/processing_clvp.py
+++ b/src/transformers/models/clvp/processing_clvp.py
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Processor class for CLVP
+"""
+from ...processing_utils import ProcessorMixin
+class ClvpProcessor(ProcessorMixin):
+    r"""
+    Constructs a CLVP processor which wraps a CLVP Feature Extractor and a CLVP Tokenizer into a single processor.
+    [`ClvpProcessor`] offers all the functionalities of [`ClvpFeatureExtractor`] and [`ClvpTokenizer`]. See the
+    [`~ClvpProcessor.__call__`], [`~ClvpProcessor.decode`] and [`~ClvpProcessor.batch_decode`] for more information.
+    Args:
+        feature_extractor (`ClvpFeatureExtractor`):
+            An instance of [`ClvpFeatureExtractor`]. The feature extractor is a required input.
+        tokenizer (`ClvpTokenizer`):
+            An instance of [`ClvpTokenizer`]. The tokenizer is a required input.
+    """
+    feature_extractor_class = "ClvpFeatureExtractor"
+    tokenizer_class = "ClvpTokenizer"
+    model_input_names = [
+        "input_ids",
+        "input_features",
+        "attention_mask",
+    ]
+    def __init__(self, feature_extractor, tokenizer):
+        super().__init__(feature_extractor, tokenizer)
+    def __call__(self, *args, **kwargs):
+        """
+        Forwards the `audio` and `sampling_rate` arguments to [`~ClvpFeatureExtractor.__call__`] and the `text`
+        argument to [`~ClvpTokenizer.__call__`]. Please refer to the doctsring of the above two methods for more
+        information.
+        """
+        raw_speech = kwargs.pop("raw_speech", None)
+        sampling_rate = kwargs.pop("sampling_rate", None)
+        text = kwargs.pop("text", None)
+        if raw_speech is None and text is None:
+            raise ValueError("You need to specify either an `raw_speech` or `text` input to process.")
+        if raw_speech is not None:
+            inputs = self.feature_extractor(raw_speech, sampling_rate=sampling_rate, **kwargs)
+        if text is not None:
+            encodings = self.tokenizer(text, **kwargs)
+        if text is None:
+            return inputs
+        elif raw_speech is None:
+            return encodings
+        else:
+            inputs["input_ids"] = encodings["input_ids"]
+            inputs["attention_mask"] = encodings["attention_mask"]
+            return inputs
+    # Copied from transformers.models.whisper.processing_whisper.WhisperProcessor.batch_decode with Whisper->Clvp
+    def batch_decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to ClvpTokenizer's [`~PreTrainedTokenizer.batch_decode`]. Please refer
+        to the docstring of this method for more information.
+        """
+        return self.tokenizer.batch_decode(*args, **kwargs)
+    # Copied from transformers.models.whisper.processing_whisper.WhisperProcessor.decode with Whisper->Clvp
+    def decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to ClvpTokenizer's [`~PreTrainedTokenizer.decode`]. Please refer to the
+        docstring of this method for more information.
+        """
+        return self.tokenizer.decode(*args, **kwargs)
--- a/src/transformers/models/clvp/tokenization_clvp.py
+++ b/src/transformers/models/clvp/tokenization_clvp.py
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization class for CLVP."""
+import json
+import os
+from functools import lru_cache
+from typing import List, Optional, Tuple
+import regex as re
+from ...tokenization_utils import AddedToken, PreTrainedTokenizer
+from ...utils import logging
+from .number_normalizer import EnglishNormalizer
+logger = logging.get_logger(__name__)
+VOCAB_FILES_NAMES = {
+    "vocab_file": "vocab.json",
+    "merges_file": "merges.txt",
+}
+PRETRAINED_VOCAB_FILES_MAP = {
+    "vocab_file": {
+        "clvp_dev": "https://huggingface.co/susnato/clvp_dev/blob/main/vocab.json",
+    },
+    "merges_file": {
+        "clvp_dev": "https://huggingface.co/susnato/clvp_dev/blob/main/merges.txt",
+    },
+}
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "clvp_dev": 1024,
+}
+@lru_cache()
+# Copied from transformers.models.gpt2.tokenization_gpt2.bytes_to_unicode
+def bytes_to_unicode():
+    """
+    Returns list of utf-8 byte and a mapping to unicode strings. We specifically avoids mapping to whitespace/control
+    characters the bpe code barfs on.
+    The reversible bpe codes work on unicode strings. This means you need a large # of unicode characters in your vocab
+    if you want to avoid UNKs. When you're at something like a 10B token dataset you end up needing around 5K for
+    decent coverage. This is a significant percentage of your normal, say, 32K bpe vocab. To avoid that, we want lookup
+    tables between utf-8 bytes and unicode strings.
+    """
+    bs = (
+        list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1))
+    )
+    cs = bs[:]
+    n = 0
+    for b in range(2**8):
+        if b not in bs:
+            bs.append(b)
+            cs.append(2**8 + n)
+            n += 1
+    cs = [chr(n) for n in cs]
+    return dict(zip(bs, cs))
+# Copied from transformers.models.gpt2.tokenization_gpt2.get_pairs
+def get_pairs(word):
+    """
+    Return set of symbol pairs in a word.
+    Word is represented as tuple of symbols (symbols being variable-length strings).
+    """
+    pairs = set()
+    prev_char = word[0]
+    for char in word[1:]:
+        pairs.add((prev_char, char))
+        prev_char = char
+    return pairs
+class ClvpTokenizer(PreTrainedTokenizer):
+    """
+    Construct a CLVP tokenizer. Based on byte-level Byte-Pair-Encoding.
+    This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
+    be encoded differently whether it is at the beginning of the sentence (without space) or not:
+    ```python
+    >>> from transformers import ClvpTokenizer
+    >>> tokenizer = ClvpTokenizer.from_pretrained("susnato/clvp_dev")
+    >>> tokenizer("Hello world")["input_ids"]
+    [62, 84, 28, 2, 179, 79]
+    >>> tokenizer(" Hello world")["input_ids"]
+    [2, 62, 84, 28, 2, 179, 79]
+    ```
+    You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer or when you
+    call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
+    <Tip>
+    When used with `is_split_into_words=True`, this tokenizer will add a space before each word (even the first one).
+    </Tip>
+    This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
+    this superclass for more information regarding those methods.
+    Args:
+        vocab_file (`str`):
+            Path to the vocabulary file.
+        merges_file (`str`):
+            Path to the merges file.
+        errors (`str`, *optional*, defaults to `"replace"`):
+            Paradigm to follow when decoding bytes to UTF-8. See
+            [bytes.decode](https://docs.python.org/3/library/stdtypes.html#bytes.decode) for more information.
+        unk_token (`str`, *optional*, defaults to `"[UNK]"`):
+            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
+            token instead.
+        bos_token (`str`, *optional*, defaults to `"<|endoftext|>"`):
+            The beginning of sequence token.
+        eos_token (`str`, *optional*, defaults to `"[STOP]"`):
+            The end of sequence token.
+        pad_token (`str`, *optional*, defaults to `"[STOP]"`):
+            The pad token of the sequence.
+        add_prefix_space (`bool`, *optional*, defaults to `False`):
+            Whether or not to add an initial space to the input. This allows to treat the leading word just as any
+            other word. (CLVP tokenizer detect beginning of words by the preceding space).
+        add_bos_token (`bool`, *optional*, defaults to `False`):
+            Whether to add `bos_token` in front of the sequence when add_special_tokens=True.
+        add_eos_token (`bool`, *optional*, defaults to `False`):
+            Whether to add `eos_token` in end of the sequence when add_special_tokens=True.
+    """
+    vocab_files_names = VOCAB_FILES_NAMES
+    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+    model_input_names = [
+        "input_ids",
+        "attention_mask",
+    ]
+    def __init__(
+        self,
+        vocab_file,
+        merges_file,
+        errors="replace",
+        unk_token="[UNK]",
+        bos_token="<|endoftext|>",
+        eos_token="[STOP]",
+        pad_token="[STOP]",
+        add_prefix_space=False,
+        add_bos_token=False,
+        add_eos_token=False,
+        **kwargs,
+    ):
+        bos_token = AddedToken(bos_token, special=True) if isinstance(bos_token, str) else bos_token
+        eos_token = AddedToken(eos_token, special=True) if isinstance(eos_token, str) else eos_token
+        unk_token = AddedToken(unk_token, special=True) if isinstance(unk_token, str) else unk_token
+        pad_token = AddedToken(pad_token, special=True) if isinstance(pad_token, str) else pad_token
+        self.add_bos_token = add_bos_token
+        self.add_eos_token = add_eos_token
+        self._normalizer = None
+        with open(vocab_file, encoding="utf-8") as vocab_handle:
+            self.encoder = json.load(vocab_handle)
+        self.decoder = {v: k for k, v in self.encoder.items()}
+        self.errors = errors  # how to handle errors in decoding
+        self.byte_encoder = bytes_to_unicode()
+        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
+        with open(merges_file, encoding="utf-8") as merges_handle:
+            bpe_merges = merges_handle.read().split("\n")[1:-1]
+        bpe_merges = [tuple(merge.split()) for merge in bpe_merges]
+        self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
+        self.cache = {}
+        self.add_prefix_space = add_prefix_space
+        # Should have added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
+        self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
+        super().__init__(
+            errors=errors,
+            unk_token=unk_token,
+            bos_token=bos_token,
+            eos_token=eos_token,
+            pad_token=pad_token,
+            add_prefix_space=add_prefix_space,
+            add_bos_token=add_bos_token,
+            add_eos_token=add_eos_token,
+            **kwargs,
+        )
+    @property
+    def vocab_size(self):
+        return len(self.encoder)
+    @property
+    def normalizer(self):
+        if self._normalizer is None:
+            self._normalizer = EnglishNormalizer()
+        return self._normalizer
+    def get_vocab(self):
+        return dict(self.encoder, **self.added_tokens_encoder)
+    # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer.bpe
+    def bpe(self, token):
+        if token in self.cache:
+            return self.cache[token]
+        word = tuple(token)
+        pairs = get_pairs(word)
+        if not pairs:
+            return token
+        while True:
+            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
+            if bigram not in self.bpe_ranks:
+                break
+            first, second = bigram
+            new_word = []
+            i = 0
+            while i < len(word):
+                try:
+                    j = word.index(first, i)
+                except ValueError:
+                    new_word.extend(word[i:])
+                    break
+                else:
+                    new_word.extend(word[i:j])
+                    i = j
+                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
+                    new_word.append(first + second)
+                    i += 2
+                else:
+                    new_word.append(word[i])
+                    i += 1
+            new_word = tuple(new_word)
+            word = new_word
+            if len(word) == 1:
+                break
+            else:
+                pairs = get_pairs(word)
+        word = " ".join(word)
+        self.cache[token] = word
+        return word
+    # Copied from transformers.models.llama.tokenization_llama.LlamaTokenizer.build_inputs_with_special_tokens
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        bos_token_id = [self.bos_token_id] if self.add_bos_token else []
+        eos_token_id = [self.eos_token_id] if self.add_eos_token else []
+        output = bos_token_id + token_ids_0 + eos_token_id
+        if token_ids_1 is not None:
+            output = output + bos_token_id + token_ids_1 + eos_token_id
+        return output
+    # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer.get_special_tokens_mask
+    def get_special_tokens_mask(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
+    ) -> List[int]:
+        """
+        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer `prepare_for_model` or `encode_plus` methods.
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+            already_has_special_tokens (`bool`, *optional*, defaults to `False`):
+                Whether or not the token list is already formatted with special tokens for the model.
+        Returns:
+            `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+        """
+        if already_has_special_tokens:
+            return super().get_special_tokens_mask(
+                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
+            )
+        if not self.add_bos_token:
+            return super().get_special_tokens_mask(
+                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=False
+            )
+        if token_ids_1 is None:
+            return [1] + ([0] * len(token_ids_0))
+        return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1))
+    def _tokenize(self, text):
+        """Tokenize a string."""
+        bpe_tokens = []
+        text = self.normalizer(text)
+        for token in re.findall(self.pat, text):
+            token = "".join(
+                self.byte_encoder[b] for b in token.encode("utf-8")
+            )  # Maps all our bytes to unicode strings, avoiding control tokens of the BPE (spaces in our case)
+            # if the token is "Ġ" we replace it with "[SPACE]" (if "[SPACE]" is present in the vocab), otherwise we keep the "Ġ".
+            bpe_tokens.extend(
+                "[SPACE]" if bpe_token == "\u0120" and "[SPACE]" in self.encoder.keys() else bpe_token
+                for bpe_token in self.bpe(token).split(" ")
+            )
+        return bpe_tokens
+    # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer._convert_token_to_id
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) in an id using the vocab."""
+        return self.encoder.get(token, self.encoder.get(self.unk_token))
+    # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer._convert_id_to_token
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        return self.decoder.get(index)
+    # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer.convert_tokens_to_string
+    def convert_tokens_to_string(self, tokens):
+        """Converts a sequence of tokens (string) in a single string."""
+        text = "".join(tokens)
+        text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)
+        return text
+    def clean_up_tokenization(self, text):
+        text = "".join(text)
+        vocab_tokens = list(self.encoder.keys()) + list(self.added_tokens_encoder.keys())
+        text = text.replace("[SPACE]", " ") if "[SPACE]" in vocab_tokens else text
+        text = text.replace("[STOP]", " ") if "[STOP]" in vocab_tokens else text
+        text = text.replace(self.unk_token, "").replace("   ", " ").replace("  ", " ")
+        return text
+    # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer.save_vocabulary
+    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
+        if not os.path.isdir(save_directory):
+            logger.error(f"Vocabulary path ({save_directory}) should be a directory")
+            return
+        vocab_file = os.path.join(
+            save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
+        )
+        merge_file = os.path.join(
+            save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["merges_file"]
+        )
+        with open(vocab_file, "w", encoding="utf-8") as f:
+            f.write(json.dumps(self.encoder, indent=2, sort_keys=True, ensure_ascii=False) + "\n")
+        index = 0
+        with open(merge_file, "w", encoding="utf-8") as writer:
+            writer.write("#version: 0.2\n")
+            for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):
+                if index != token_index:
+                    logger.warning(
+                        f"Saving vocabulary to {merge_file}: BPE merge indices are not consecutive."
+                        " Please check that the tokenizer is not corrupted!"
+                    )
+                    index = token_index
+                writer.write(" ".join(bpe_tokens) + "\n")
+                index += 1
+        return vocab_file, merge_file
--- a/src/transformers/utils/dummy_pt_objects.py
+++ b/src/transformers/utils/dummy_pt_objects.py
@@ -1940,6 +1940,51 @@ class CLIPSegVisionModel(metaclass=DummyObject):
        requires_backends(self, ["torch"])
+CLVP_PRETRAINED_MODEL_ARCHIVE_LIST = None
+class ClvpDecoder(metaclass=DummyObject):
+    _backends = ["torch"]
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+class ClvpEncoder(metaclass=DummyObject):
+    _backends = ["torch"]
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+class ClvpForCausalLM(metaclass=DummyObject):
+    _backends = ["torch"]
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+class ClvpModel(metaclass=DummyObject):
+    _backends = ["torch"]
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+class ClvpModelForConditionalGeneration(metaclass=DummyObject):
+    _backends = ["torch"]
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+class ClvpPreTrainedModel(metaclass=DummyObject):
+    _backends = ["torch"]
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
 CODEGEN_PRETRAINED_MODEL_ARCHIVE_LIST = None

--- a/tests/models/clvp/__init__.py
+++ b/tests/models/clvp/__init__.py
--- a/tests/models/clvp/test_feature_extraction_clvp.py
+++ b/tests/models/clvp/test_feature_extraction_clvp.py
+# coding=utf-8
+# Copyright 2023 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import gc
+import itertools
+import os
+import random
+import tempfile
+import unittest
+import numpy as np
+from datasets import Audio, load_dataset
+from transformers import ClvpFeatureExtractor
+from transformers.testing_utils import check_json_file_has_correct_format, require_torch, slow
+from transformers.utils.import_utils import is_torch_available
+from ...test_sequence_feature_extraction_common import SequenceFeatureExtractionTestMixin
+if is_torch_available():
+    import torch
+global_rng = random.Random()
+# Copied from transformers.tests.models.whisper.test_feature_extraction_whisper.floats_list
+def floats_list(shape, scale=1.0, rng=None, name=None):
+    """Creates a random float32 tensor"""
+    if rng is None:
+        rng = global_rng
+    values = []
+    for batch_idx in range(shape[0]):
+        values.append([])
+        for _ in range(shape[1]):
+            values[-1].append(rng.random() * scale)
+    return values
+@require_torch
+class ClvpFeatureExtractionTester(unittest.TestCase):
+    def __init__(
+        self,
+        parent,
+        batch_size=7,
+        min_seq_length=400,
+        max_seq_length=2000,
+        feature_size=10,
+        hop_length=160,
+        chunk_length=8,
+        padding_value=0.0,
+        sampling_rate=4_000,
+        return_attention_mask=False,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.min_seq_length = min_seq_length
+        self.max_seq_length = max_seq_length
+        self.seq_length_diff = (self.max_seq_length - self.min_seq_length) // (self.batch_size - 1)
+        self.padding_value = padding_value
+        self.sampling_rate = sampling_rate
+        self.return_attention_mask = return_attention_mask
+        self.feature_size = feature_size
+        self.chunk_length = chunk_length
+        self.hop_length = hop_length
+    def prepare_feat_extract_dict(self):
+        return {
+            "feature_size": self.feature_size,
+            "hop_length": self.hop_length,
+            "chunk_length": self.chunk_length,
+            "padding_value": self.padding_value,
+            "sampling_rate": self.sampling_rate,
+            "return_attention_mask": self.return_attention_mask,
+        }
+    # Copied from transformers.tests.models.whisper.test_feature_extraction_whisper.WhisperFeatureExtractionTester.prepare_inputs_for_common
+    def prepare_inputs_for_common(self, equal_length=False, numpify=False):
+        def _flatten(list_of_lists):
+            return list(itertools.chain(*list_of_lists))
+        if equal_length:
+            speech_inputs = [floats_list((self.max_seq_length, self.feature_size)) for _ in range(self.batch_size)]
+        else:
+            # make sure that inputs increase in size
+            speech_inputs = [
+                floats_list((x, self.feature_size))
+                for x in range(self.min_seq_length, self.max_seq_length, self.seq_length_diff)
+            ]
+        if numpify:
+            speech_inputs = [np.asarray(x) for x in speech_inputs]
+        return speech_inputs
+@require_torch
+class ClvpFeatureExtractionTest(SequenceFeatureExtractionTestMixin, unittest.TestCase):
+    feature_extraction_class = ClvpFeatureExtractor
+    def setUp(self):
+        self.feat_extract_tester = ClvpFeatureExtractionTester(self)
+    def tearDown(self):
+        super().tearDown()
+        # clean-up as much as possible GPU memory occupied by PyTorch
+        gc.collect()
+        torch.cuda.empty_cache()
+    # Copied from transformers.tests.models.whisper.test_feature_extraction_whisper.WhisperFeatureExtractionTest.test_feat_extract_from_and_save_pretrained
+    def test_feat_extract_from_and_save_pretrained(self):
+        feat_extract_first = self.feature_extraction_class(**self.feat_extract_dict)
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            saved_file = feat_extract_first.save_pretrained(tmpdirname)[0]
+            check_json_file_has_correct_format(saved_file)
+            feat_extract_second = self.feature_extraction_class.from_pretrained(tmpdirname)
+        dict_first = feat_extract_first.to_dict()
+        dict_second = feat_extract_second.to_dict()
+        mel_1 = feat_extract_first.mel_filters
+        mel_2 = feat_extract_second.mel_filters
+        self.assertTrue(np.allclose(mel_1, mel_2))
+        self.assertEqual(dict_first, dict_second)
+    # Copied from transformers.tests.models.whisper.test_feature_extraction_whisper.WhisperFeatureExtractionTest.test_feat_extract_to_json_file
+    def test_feat_extract_to_json_file(self):
+        feat_extract_first = self.feature_extraction_class(**self.feat_extract_dict)
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            json_file_path = os.path.join(tmpdirname, "feat_extract.json")
+            feat_extract_first.to_json_file(json_file_path)
+            feat_extract_second = self.feature_extraction_class.from_json_file(json_file_path)
+        dict_first = feat_extract_first.to_dict()
+        dict_second = feat_extract_second.to_dict()
+        mel_1 = feat_extract_first.mel_filters
+        mel_2 = feat_extract_second.mel_filters
+        self.assertTrue(np.allclose(mel_1, mel_2))
+        self.assertEqual(dict_first, dict_second)
+    def test_call(self):
+        # Tests that all call wrap to encode_plus and batch_encode_plus
+        feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
+        # create three inputs of length 800, 1000, and 1200
+        speech_inputs = [floats_list((1, x))[0] for x in range(800, 1400, 200)]
+        np_speech_inputs = [np.asarray(speech_input) for speech_input in speech_inputs]
+        # Test feature size
+        input_features = feature_extractor(np_speech_inputs, padding="max_length", return_tensors="np").input_features
+        self.assertTrue(input_features.ndim == 3)
+        self.assertTrue(input_features.shape[-2] == feature_extractor.feature_size)
+        # Test not batched input
+        encoded_sequences_1 = feature_extractor(speech_inputs[0], return_tensors="np").input_features
+        encoded_sequences_2 = feature_extractor(np_speech_inputs[0], return_tensors="np").input_features
+        self.assertTrue(np.allclose(encoded_sequences_1, encoded_sequences_2, atol=1e-3))
+        # Test batched
+        encoded_sequences_1 = feature_extractor(speech_inputs, return_tensors="np").input_features
+        encoded_sequences_2 = feature_extractor(np_speech_inputs, return_tensors="np").input_features
+        for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
+            self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
+        # Test 2-D numpy arrays are batched.
+        speech_inputs = [floats_list((1, x))[0] for x in (800, 800, 800)]
+        np_speech_inputs = np.asarray(speech_inputs)
+        encoded_sequences_1 = feature_extractor(speech_inputs, return_tensors="np").input_features
+        encoded_sequences_2 = feature_extractor(np_speech_inputs, return_tensors="np").input_features
+        for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
+            self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
+        # Test truncation required
+        speech_inputs = [floats_list((1, x))[0] for x in range(200, (feature_extractor.n_samples + 500), 200)]
+        np_speech_inputs = [np.asarray(speech_input) for speech_input in speech_inputs]
+        speech_inputs_truncated = [x[: feature_extractor.n_samples] for x in speech_inputs]
+        np_speech_inputs_truncated = [np.asarray(speech_input) for speech_input in speech_inputs_truncated]
+        encoded_sequences_1 = feature_extractor(np_speech_inputs, return_tensors="np").input_features
+        encoded_sequences_2 = feature_extractor(np_speech_inputs_truncated, return_tensors="np").input_features
+        for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
+            self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
+    # Copied from transformers.tests.models.whisper.test_feature_extraction_whisper.WhisperFeatureExtractionTest.test_double_precision_pad
+    def test_double_precision_pad(self):
+        import torch
+        feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
+        np_speech_inputs = np.random.rand(100, 32).astype(np.float64)
+        py_speech_inputs = np_speech_inputs.tolist()
+        for inputs in [py_speech_inputs, np_speech_inputs]:
+            np_processed = feature_extractor.pad([{"input_features": inputs}], return_tensors="np")
+            self.assertTrue(np_processed.input_features.dtype == np.float32)
+            pt_processed = feature_extractor.pad([{"input_features": inputs}], return_tensors="pt")
+            self.assertTrue(pt_processed.input_features.dtype == torch.float32)
+    def _load_datasamples(self, num_samples):
+        ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+        ds = ds.cast_column("audio", Audio(sampling_rate=22050))
+        # automatic decoding with librispeech
+        speech_samples = ds.sort("id").select(range(num_samples))[:num_samples]["audio"]
+        return [x["array"] for x in speech_samples], [x["sampling_rate"] for x in speech_samples]
+    @slow
+    def test_integration(self):
+        # fmt: off
+        EXPECTED_INPUT_FEATURES = torch.tensor(
+            [
+                0.9271, 1.1405, 1.4419, 1.2470, 1.2438, 1.1787, 1.0595, 1.0570, 1.1070,
+                1.2205, 1.2376, 1.2997, 1.1131, 1.0843, 1.0459, 1.1858, 1.2323, 1.3582,
+                1.3401, 1.3770, 1.4173, 1.3381, 1.2291, 1.0854, 1.2116, 1.1873, 1.2178,
+                1.2137, 1.3001, 1.4274
+            ]
+        )
+        # fmt: on
+        input_speech, sr = self._load_datasamples(1)
+        feature_extractor = ClvpFeatureExtractor.from_pretrained("susnato/clvp_dev")
+        input_features = feature_extractor(input_speech, sampling_rate=sr[0], return_tensors="pt").input_features
+        self.assertEqual(input_features.shape, (1, 80, 517))
+        self.assertTrue(torch.allclose(input_features[0, 0, :30], EXPECTED_INPUT_FEATURES, atol=1e-4))
--- a/tests/models/clvp/test_modeling_clvp.py
+++ b/tests/models/clvp/test_modeling_clvp.py
--- a/tests/models/clvp/test_processor_clvp.py
+++ b/tests/models/clvp/test_processor_clvp.py
--- a/tests/models/clvp/test_tokenization_clvp.py
+++ b/tests/models/clvp/test_tokenization_clvp.py
--- a/utils/check_repo.py
+++ b/utils/check_repo.py
@@ -207,6 +207,8 @@ IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
    "CLIPTextModelWithProjection",
    "CLIPVisionModel",
    "CLIPVisionModelWithProjection",
+    "ClvpForCausalLM",
+    "ClvpModel",
    "GroupViTTextModel",
    "GroupViTVisionModel",
    "TFCLIPTextModel",