"vscode:/vscode.git/clone" did not exist on "857fd8eaabdbcd587782ba11a97a0aface16c09a"
Unverified Commit c236a621 authored by Arthur's avatar Arthur Committed by GitHub
Browse files

[CLAP] Add CLAP to the library (#21370)



* add model like clip

* update

* text model ok

* clap text works

* some refactor

- `CLAPVision` to `CLAPAudio`
- refactor kwargs of audio modules

* more refactor

* more refactor

* more refactor

* correct fusion

* more refactor

* new modules

* add basic processor

* fixup

* remove whisper copioed from

* audio logits match

* add doc

* correct filters mel and add maxlength

* style

* few fixes

* forward passes

* fixup

* fixup

* some clean up

* remove mels form the dictionnary

* pad after the repeat

* update padding when dsmaller

* fix padding

* style

* use swin patch merging

* use copied from swin

* processor with any tokenizer

* more copied from

* some clean up

* more refactor

* fix mel when rand_trunc

* style

* remove unused imports

* update processing

* remove image processing tests

* add testing fiel

* fixmodeling issues

* replace with `is_longer`

* clap in serialization

* more refactor

* `make fixup`

* make fixup

* fix feature extractor

* update test feature extractor

* `make fixup`

* clean up config

* more clean up

* more cleanup

* update tests

* refactor tests and inits

* removeCLAP vision config

* remove CLAP from image procssing auto and dummy vision objects

* update inits

* style

* re order classes in modeling clap

* Use roberta tokenizer as the other weights are not open sourced

* small cleaup

* remove tokenization CLAP

* processor tokenizr is roberta

* update feature extraction doc

* remove vclap from model zero shot

* update f_min and f_max to frequency_xx

* some changes

- fix modeling keys
- add `is_longer` in the forward pass
- make fixup

* make fixup

* consistent behavior ebtween rand_crop and fusion

* add numpy resize and bilinear and documentation

* move resizing to image utils

* clean feature extraction

* import resize from correct file

* resize in image transforms

* update

* style

* style

* nit

* remove unused arguments form the feature extractor

* style

* few fixes + make fixup

* oops

* fix more tests

* add zero shot audio classification pipeline

* update zeroshot classification pipeline

* fixup

* fix copies

* all CI tests pass

* make fixup + fix docs

* fix docs

* fix docs

* update tests pip;eline

* update zero shot pipeline

* update feature extraction clap

* update tokenization auto

* use nested simplify

* update pipeline tests

* Apply suggestions from code review
Co-authored-by: default avatarArthur <48595927+ArthurZucker@users.noreply.github.com>

* split in two lines

* fixes

* refactor

* clean up

* add integration tests

* update config docstring

* style

* update processor

* fix processor test

* fix feat extractor tests

* update docs

* Apply suggestions from code review
Co-authored-by: default avatarArthur <48595927+ArthurZucker@users.noreply.github.com>

* fix readmes

* fix tips

* Update src/transformers/models/auto/configuration_auto.py

* update doc and remove todo -> properly explained

* fix idx and typo

* typoe

* cleanup config

* cleanup tests, styles and doc

* ignore docstyle on image transform

* add conversion script

* remove the `clap` indx in favor of `CLAP`

* update __init

* nits

* Update src/transformers/pipelines/__init__.py

* fix bug

* clarifiy config

* fix copy

* fix init

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fix model output

* fix comment

* make fixup

* make fixup

* rename to `Clap`

* replace to `Clap`

* replace to `Clap`

* repo consistency

* again repo-consistency

* make fixup

* Apply suggestions from code review
Co-authored-by: default avatarSanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* add config

* changes

* update conversion

* Apply suggestions from code review
Co-authored-by: default avatarSanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* remove unused function

* update based on code reviews

* style

* more comments

* cleanup

* clean up

* style

* apply suggestions

* Empty commit

* pipeline will be added in a different PR

* update calls to audio utils functions

* update pipeline init

* style

* style

* styling again

* use pad

* fix repo-consistency

* update utils and add doc for audio utils

* clean up resize by using torch. update inits accordingly

* style

* CLap's  tokenizer is RobertA

* add audio utils to internal toctreee

* update totctree

* style

* update documentation and normalize naming accross audio utils and feature extraction clap

* style

* clean up

* update doc and typos

* fix doctest

* update modelin code, got rid of a lot of reshaping

* style on added doc audio utils

* update modeling clap

* style

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* docstringvariables with CLAP

* rename key

* update modeling CLAP

* update audio utils docstring

* update processing clap

* fix readmes

* fix toctree

* udpate configuration clap

* fix init

* make fixup

* fix

* fix

* update naming

* update

* update checkpoint path

* Apply suggestions from code review

* Major refactoring

* Update src/transformers/models/clap/configuration_clap.py

* merge

---------
Co-authored-by: default avataryounesbelkada <younesbelkada@gmail.com>
Co-authored-by: default avatarYounes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarSanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
parent 6b0257de
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available
_import_structure = {
"configuration_clap": [
"CLAP_PRETRAINED_MODEL_ARCHIVE_LIST",
"ClapAudioConfig",
"ClapConfig",
"ClapTextConfig",
],
"processing_clap": ["ClapProcessor"],
}
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["modeling_clap"] = [
"CLAP_PRETRAINED_MODEL_ARCHIVE_LIST",
"ClapModel",
"ClapPreTrainedModel",
"ClapTextModel",
"ClapTextModelWithProjection",
"ClapAudioModel",
"ClapAudioModelWithProjection",
]
_import_structure["feature_extraction_clap"] = ["ClapFeatureExtractor"]
if TYPE_CHECKING:
from .configuration_clap import (
CLAP_PRETRAINED_MODEL_ARCHIVE_LIST,
ClapAudioConfig,
ClapConfig,
ClapTextConfig,
)
from .processing_clap import ClapProcessor
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .feature_extraction_clap import ClapFeatureExtractor
from .modeling_clap import (
CLAP_PRETRAINED_MODEL_ARCHIVE_LIST,
ClapAudioModel,
ClapAudioModelWithProjection,
ClapModel,
ClapPreTrainedModel,
ClapTextModel,
ClapTextModelWithProjection,
)
else:
import sys
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" CLAP model configuration"""
import copy
import os
from typing import Union
from ...configuration_utils import PretrainedConfig
from ...utils import logging
logger = logging.get_logger(__name__)
CLAP_PRETRAINED_MODEL_ARCHIVE_LIST = {
"laion/clap-htsat-fused": "https://huggingface.co/laion/clap-htsat-fused/resolve/main/config.json",
"laion/clap-htsat-unfused": "https://huggingface.co/laion/clap-htsat-unfused/resolve/main/config.json",
}
class ClapTextConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`ClapTextModel`]. It is used to instantiate a CLAP
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a similar configuration to that of the CLAP
[calp-hsat-fused](https://huggingface.co/laion/clap-hsat-fused) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
vocab_size (`int`, *optional*, defaults to 30522):
Vocabulary size of the CLAP model. Defines the number of different tokens that can be represented by the
`inputs_ids` passed when calling [`ClapTextModel`].
hidden_size (`int`, *optional*, defaults to 768):
Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (`int`, *optional*, defaults to 12):
Number of hidden layers in the Transformer encoder.
num_attention_heads (`int`, *optional*, defaults to 12):
Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (`int`, *optional*, defaults to 3072):
Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
hidden_act (`str` or `Callable`, *optional*, defaults to `"relu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string, `"relu"`,
`"relu"`, `"silu"` and `"relu_new"` are supported.
hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
The dropout ratio for the attention probabilities.
max_position_embeddings (`int`, *optional*, defaults to 512):
The maximum sequence length that this model might ever be used with. Typically set this to something large
just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (`int`, *optional*, defaults to 2):
The vocabulary size of the `token_type_ids` passed when calling [`ClapTextModel`].
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (`float`, *optional*, defaults to 1e-12):
The epsilon used by the layer normalization layers.
position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
[Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
is_decoder (`bool`, *optional*, defaults to `False`):
Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.
use_cache (`bool`, *optional*, defaults to `True`):
Whether or not the model should return the last key/values attentions (not used by all models). Only
relevant if `config.is_decoder=True`.
classifier_dropout (`float`, *optional*):
The dropout ratio for the classification head.
projection_hidden_act (`str`, *optional*, defaults to `"relu"`):
The non-linear activation function (function or string) in the projection layer. If string, `"gelu"`,
`"relu"`, `"silu"` and `"gelu_new"` are supported.
projection_dim (`int`, *optional*, defaults to 512)
Dimension of the projection head of the `ClapTextModelWithProjection`.
Examples:
```python
>>> from transformers import ClapTextConfig, ClapTextModel
>>> # Initializing a CLAP text configuration
>>> configuration = ClapTextConfig()
>>> # Initializing a model (with random weights) from the configuration
>>> model = ClapTextModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "clap_text_model"
def __init__(
self,
vocab_size=50265,
hidden_size=768,
num_hidden_layers=12,
num_attention_heads=12,
intermediate_size=3072,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
max_position_embeddings=514,
type_vocab_size=1,
initializer_range=0.02,
initializer_factor=1.0,
layer_norm_eps=1e-12,
projection_dim=512,
pad_token_id=1,
bos_token_id=0,
eos_token_id=2,
position_embedding_type="absolute",
use_cache=True,
classifier_dropout=None,
projection_hidden_act="relu",
**kwargs,
):
super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.hidden_act = hidden_act
self.intermediate_size = intermediate_size
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.type_vocab_size = type_vocab_size
self.initializer_range = initializer_range
self.initializer_factor = initializer_factor
self.layer_norm_eps = layer_norm_eps
self.position_embedding_type = position_embedding_type
self.use_cache = use_cache
self.classifier_dropout = classifier_dropout
self.projection_hidden_act = projection_hidden_act
self.projection_dim = projection_dim
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
# get the text config dict if we are loading from ClapConfig
if config_dict.get("model_type") == "clap":
config_dict = config_dict["text_config"]
if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
logger.warning(
f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
)
return cls.from_dict(config_dict, **kwargs)
class ClapAudioConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`ClapAudioModel`]. It is used to instantiate a
CLAP audio encoder according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the audio encoder of the CLAP
[laion/clap-htsat-fused](https://huggingface.co/laion/clap-htsat-fused) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
window_size (`int`, *optional*, defaults to 8):
Image size of the spectrogram
num_mel_bins (`int`, *optional*, defaults to 64):
Number of mel features used per frames. Should correspond to the value used in the `ClapProcessor` class.
spec_size (`int`, *optional*, defaults to 256):
Desired input size of the spectrogram that the model supports. It can be different from the output of the
`ClapFeatureExtractor`, in which case the input features will be resized. Corresponds to the `image_size`
of the audio models.
hidden_act (`str`, *optional*, defaults to `"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
`"relu"`, `"silu"` and `"gelu_new"` are supported.
patch_size (`int`, *optional*, defaults to 4):
Patch size for the audio spectrogram
patch_stride (`list`, *optional*, defaults to `[4, 4]`):
Patch stride for the audio spectrogram
num_classes (`int`, *optional*, defaults to 527):
Number of classes used for the head training
hidden_size (`int`, *optional*, defaults to 768):
Hidden size of the output of the audio encoder. Correspond to the dimension of the penultimate layer's
output,which is sent to the projection MLP layer.
projection_dim (`int`, *optional*, defaults to 512):
Hidden size of the projection layer.
depths (`list`, *optional*, defaults to `[2, 2, 6, 2]`):
Depths used for the Swin Layers of the audio model
num_attention_heads (`list`, *optional*, defaults to `[4, 8, 16, 32]`):
Number of attention heads used for the Swin Layers of the audio model
enable_fusion (`bool`, *optional*, defaults to `False`):
Whether or not to enable patch fusion. This is the main contribution of the authors, and should give the
best results.
hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
The dropout probabilitiy for all fully connected layers in the encoder.
fusion_type (`[type]`, *optional*):
Fusion type used for the patch fusion.
patch_embed_input_channels (`int`, *optional*, defaults to 1):
Number of channels used for the input spectrogram
flatten_patch_embeds (`bool`, *optional*, defaults to `True`):
Whether or not to flatten the patch embeddings
patch_embeds_hidden_size (`int`, *optional*, defaults to 96):
Hidden size of the patch embeddings. It is used as the number of output channels.
enable_patch_layer_norm (`bool`, *optional*, defaults to `True`):
Whether or not to enable layer normalization for the patch embeddings
drop_path_rate (`float`, *optional*, defaults to 0.0):
Drop path rate for the patch fusion
attention_probs_dropout_prob (`float`, *optional*, defaults to 0.0):
The dropout ratio for the attention probabilities.
qkv_bias (`bool`, *optional*, defaults to `True`):
Whether or not to add a bias to the query, key, value projections.
mlp_ratio (`float`, *optional*, defaults to 4.0):
Ratio of the mlp hidden dim to embedding dim.
aff_block_r (`int`, *optional*, defaults to 4):
downsize_ratio used in the AudioFF block
num_hidden_layers (`int`, *optional*, defaults to 4):
Number of hidden layers in the Transformer encoder.
projection_hidden_act (`str`, *optional*, defaults to `"relu"`):
The non-linear activation function (function or string) in the projection layer. If string, `"gelu"`,
`"relu"`, `"silu"` and `"gelu_new"` are supported.
layer_norm_eps (`[type]`, *optional*, defaults to `1e-5`):
The epsilon used by the layer normalization layers.
initializer_factor (`float`, *optional*, defaults to 1.0):
A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
testing).
Example:
```python
>>> from transformers import ClapAudioConfig, ClapAudioModel
>>> # Initializing a ClapAudioConfig with laion/clap-htsat-fused style configuration
>>> configuration = ClapAudioConfig()
>>> # Initializing a ClapAudioModel (with random weights) from the laion/clap-htsat-fused style configuration
>>> model = ClapAudioModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "clap_audio_model"
def __init__(
self,
window_size=8,
num_mel_bins=64,
spec_size=256,
hidden_act="gelu",
patch_size=4,
patch_stride=[4, 4],
num_classes=527,
hidden_size=768,
projection_dim=512,
depths=[2, 2, 6, 2],
num_attention_heads=[4, 8, 16, 32],
enable_fusion=False,
hidden_dropout_prob=0.1,
fusion_type=None,
patch_embed_input_channels=1,
flatten_patch_embeds=True,
patch_embeds_hidden_size=96,
enable_patch_layer_norm=True,
drop_path_rate=0.0,
attention_probs_dropout_prob=0.0,
qkv_bias=True,
mlp_ratio=4.0,
aff_block_r=4,
num_hidden_layers=4,
projection_hidden_act="relu",
layer_norm_eps=1e-5,
initializer_factor=1.0,
**kwargs,
):
super().__init__(**kwargs)
self.window_size = window_size
self.num_mel_bins = num_mel_bins
self.spec_size = spec_size
self.patch_size = patch_size
self.patch_stride = patch_stride
self.num_classes = num_classes
self.hidden_size = hidden_size
self.depths = depths
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.window_size = window_size
self.enable_fusion = enable_fusion
self.fusion_type = fusion_type
self.hidden_act = hidden_act
self.hidden_dropout_prob = hidden_dropout_prob
self.projection_dim = projection_dim
self.flatten_patch_embeds = flatten_patch_embeds
self.patch_embeds_hidden_size = patch_embeds_hidden_size
self.enable_patch_layer_norm = enable_patch_layer_norm
self.drop_path_rate = drop_path_rate
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.qkv_bias = qkv_bias
self.mlp_ratio = mlp_ratio
self.patch_embed_input_channels = patch_embed_input_channels
self.aff_block_r = aff_block_r
self.layer_norm_eps = layer_norm_eps
self.initializer_factor = initializer_factor
self.projection_hidden_act = projection_hidden_act
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
# get the audio config dict if we are loading from ClapConfig
if config_dict.get("model_type") == "clap":
config_dict = config_dict["audio_config"]
if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
logger.warning(
f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
)
return cls.from_dict(config_dict, **kwargs)
class ClapConfig(PretrainedConfig):
r"""
[`ClapConfig`] is the configuration class to store the configuration of a [`ClapModel`]. It is used to instantiate
a CLAP model according to the specified arguments, defining the text model and audio model configs. Instantiating a
configuration with the defaults will yield a similar configuration to that of the CLAP
[laion/clap-htsat-fused](https://huggingface.co/laion/clap-htsat-fused) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
text_config (`dict`, *optional*):
Dictionary of configuration options used to initialize [`ClapTextConfig`].
audio_config (`dict`, *optional*):
Dictionary of configuration options used to initialize [`ClapAudioConfig`].
projection_dim (`int`, *optional*, defaults to 512):
Dimentionality of text and audio projection layers.
logit_scale_init_value (`float`, *optional*, defaults to 2.6592):
The inital value of the *logit_scale* paramter. Default is used as per the original CLAP implementation.
projection_hidden_act (`str`, *optional*, defaults to `"relu"`):
Activation function for the projection layers.
initializer_factor (`float`, *optional*, defaults to 1.0):
Factor to scale the initialization of the model weights.
kwargs (*optional*):
Dictionary of keyword arguments.
Example:
```python
>>> from transformers import ClapConfig, ClapModel
>>> # Initializing a ClapConfig with laion-ai/base style configuration
>>> configuration = ClapConfig()
>>> # Initializing a ClapModel (with random weights) from the laion-ai/base style configuration
>>> model = ClapModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
>>> # We can also initialize a ClapConfig from a ClapTextConfig and a ClapAudioConfig
>>> from transformers import ClapTextConfig, ClapAudioConfig
>>> # Initializing a ClapText and ClapAudioConfig configuration
>>> config_text = ClapTextConfig()
>>> config_audio = ClapAudioConfig()
>>> config = ClapConfig.from_text_audio_configs(config_text, config_audio)
```"""
model_type = "clap"
is_composition = True
def __init__(
self,
text_config=None,
audio_config=None,
logit_scale_init_value=(1 / 0.07),
projection_dim=512,
projection_hidden_act="relu",
initializer_factor=1.0,
**kwargs,
):
super().__init__(**kwargs)
if text_config is None:
text_config = {}
logger.info("text_config is None. Initializing the ClapTextConfig with default values.")
if audio_config is None:
audio_config = {}
logger.info("audio_config is None. initializing the ClapAudioConfig with default values.")
self.text_config = ClapTextConfig(**text_config)
self.audio_config = ClapAudioConfig(**audio_config)
self.text_config.projection_dim = projection_dim
self.audio_config.projection_dim = projection_dim
self.text_config.projection_hidden_act = projection_hidden_act
self.audio_config.projection_hidden_act = projection_hidden_act
self.projection_dim = projection_dim
self.projection_hidden_act = projection_hidden_act
self.hidden_size = self.text_config.hidden_size
self.logit_scale_init_value = logit_scale_init_value
self.initializer_factor = initializer_factor
self.num_hidden_layers = self.text_config.num_hidden_layers + len(self.audio_config.depths)
@classmethod
def from_text_audio_configs(cls, text_config: ClapTextConfig, audio_config: ClapAudioConfig, **kwargs):
r"""
Instantiate a [`ClapConfig`] (or a derived class) from clap text model configuration and clap audio model
configuration.
Returns:
[`ClapConfig`]: An instance of a configuration object
"""
return cls(text_config=text_config.to_dict(), audio_config=audio_config.to_dict(), **kwargs)
def to_dict(self):
"""
Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
Returns:
`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
"""
output = copy.deepcopy(self.__dict__)
output["text_config"] = self.text_config.to_dict()
output["audio_config"] = self.audio_config.to_dict()
output["model_type"] = self.__class__.model_type
return output
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import re
import torch
from CLAP import create_model
from transformers import AutoFeatureExtractor, ClapConfig, ClapModel
KEYS_TO_MODIFY_MAPPING = {
"text_branch": "text_model",
"audio_branch": "audio_model.audio_encoder",
"attn": "attention.self",
"self.proj": "output.dense",
"attention.self_mask": "attn_mask",
"mlp.fc1": "intermediate.dense",
"mlp.fc2": "output.dense",
"norm1": "layernorm_before",
"norm2": "layernorm_after",
"bn0": "batch_norm",
}
processor = AutoFeatureExtractor.from_pretrained("laion/clap-htsat-unfused", truncation="rand_trunc")
def init_clap(checkpoint_path, enable_fusion=False):
model, model_cfg = create_model(
"HTSAT-tiny",
"roberta",
checkpoint_path,
precision="fp32",
device="cuda:0" if torch.cuda.is_available() else "cpu",
enable_fusion=enable_fusion,
fusion_type="aff_2d" if enable_fusion else None,
)
return model, model_cfg
def rename_state_dict(state_dict):
model_state_dict = {}
sequential_layers_pattern = r".*sequential.(\d+).*"
text_projection_pattern = r".*_projection.(\d+).*"
for key, value in state_dict.items():
# check if any key needs to be modified
for key_to_modify, new_key in KEYS_TO_MODIFY_MAPPING.items():
if key_to_modify in key:
key = key.replace(key_to_modify, new_key)
if re.match(sequential_layers_pattern, key):
# replace sequential layers with list
sequential_layer = re.match(sequential_layers_pattern, key).group(1)
key = key.replace(f"sequential.{sequential_layer}.", f"layers.{int(sequential_layer)//3}.linear.")
elif re.match(text_projection_pattern, key):
projecton_layer = int(re.match(text_projection_pattern, key).group(1))
# Because in CLAP they use `nn.Sequential`...
transformers_projection_layer = 1 if projecton_layer == 0 else 2
key = key.replace(f"_projection.{projecton_layer}.", f"_projection.linear{transformers_projection_layer}.")
if "audio" and "qkv" in key:
# split qkv into query key and value
mixed_qkv = value
qkv_dim = mixed_qkv.size(0) // 3
query_layer = mixed_qkv[:qkv_dim]
key_layer = mixed_qkv[qkv_dim : qkv_dim * 2]
value_layer = mixed_qkv[qkv_dim * 2 :]
model_state_dict[key.replace("qkv", "query")] = query_layer
model_state_dict[key.replace("qkv", "key")] = key_layer
model_state_dict[key.replace("qkv", "value")] = value_layer
else:
model_state_dict[key] = value
return model_state_dict
def convert_clap_checkpoint(checkpoint_path, pytorch_dump_folder_path, config_path, enable_fusion=False):
clap_model, clap_model_cfg = init_clap(checkpoint_path, enable_fusion=enable_fusion)
clap_model.eval()
state_dict = clap_model.state_dict()
state_dict = rename_state_dict(state_dict)
transformers_config = ClapConfig()
transformers_config.audio_config.enable_fusion = enable_fusion
model = ClapModel(transformers_config)
# ignore the spectrogram embedding layer
model.load_state_dict(state_dict, strict=False)
model.save_pretrained(pytorch_dump_folder_path)
transformers_config.save_pretrained(pytorch_dump_folder_path)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model.")
parser.add_argument("--checkpoint_path", default=None, type=str, help="Path to fairseq checkpoint")
parser.add_argument("--config_path", default=None, type=str, help="Path to hf config.json of model to convert")
parser.add_argument("--enable_fusion", action="store_true", help="Whether to enable fusion or not")
args = parser.parse_args()
convert_clap_checkpoint(args.checkpoint_path, args.pytorch_dump_folder_path, args.config_path, args.enable_fusion)
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Feature extractor class for CLAP."""
import copy
from typing import Any, Dict, List, Optional, Union
import numpy as np
import torch
from ...audio_utils import fram_wave, get_mel_filter_banks, power_to_db, stft
from ...feature_extraction_sequence_utils import SequenceFeatureExtractor
from ...feature_extraction_utils import BatchFeature
from ...utils import TensorType, logging
logger = logging.get_logger(__name__)
class ClapFeatureExtractor(SequenceFeatureExtractor):
r"""
Constructs a CLAP feature extractor.
This feature extractor inherits from [`~feature_extraction_sequence_utils.SequenceFeatureExtractor`] which contains
most of the main methods. Users should refer to this superclass for more information regarding those methods.
This class extracts mel-filter bank features from raw speech using a custom numpy implementation of the *Short Time
Fourier Transform* (STFT) which should match pytorch's `torch.stft` equivalent.
Args:
feature_size (`int`, defaults to 64):
The feature dimension of the extracted Mel spectrograms. This corresponds to the number of mel filters
(`n_mels`).
sampling_rate (`int`, defaults to 48_000):
The sampling rate at which the audio files should be digitalized expressed in hertz (Hz). This only serves
to warn users if the audio fed to the feature extractor does not have the same sampling rate.
hop_length (`int`, defaults to 480):
Length of the overlaping windows for the STFT used to obtain the Mel Spectrogram. The audio will be split
in smaller `frames` with a step of `hop_length` between each frame.
max_length_s (`int`, defaults to 10):
The maximum input lenght of the model in seconds. This is used to pad the audio.
fft_window_size (`int`, defaults to 1024):
Size of the window (in samples) on which the Fourier transform is applied. This controls the frequency
resolution of the spectrogram. 400 means that the fourrier transform is computed on windows of 400 samples.
padding_value (`float`, *optional*, defaults to 0.0):
Padding value used to pad the audio. Should correspond to silences.
return_attention_mask (`bool`, *optional*, defaults to `False`):
Whether or not the model should return the attention masks coresponding to the input.
frequency_min (`float`, *optional*, default to 0):
The lowest frequency of interest. The STFT will not be computed for values below this.
frequency_max (`float`, *optional*, default to 14_000):
The highest frequency of interest. The STFT will not be computed for values above this.
top_db (`float`, *optional*):
The highest decibel value used to convert the mel spectrogram to the log scale. For more details see the
`audio_utils.power_to_db` function
truncation (`str`, *optional*, default to `"fusions"`):
Truncation pattern for long audio inputs. Two patterns are available:
- `fusion` will use `_random_mel_fusion`, which stacks 3 random crops from the mel spectrogram and a
downsampled version of the entire mel spectrogram.
If `config.fusion` is set to True, shorter audios also need to to return 4 mels, which will just be a copy
of the original mel obtained from the padded audio.
- `rand_trunc` will select a random crop of the mel spectrogram.
padding (`str`, *optional*, defaults to `"repeatpad"`):
Padding pattern for shorter audio inputs. Three patterns were originally implemented:
- `repeatpad`: the audio is repeated, and then padded to fit the `max_length`.
- `repeat`: the audio is repeated and then cut to fit the `max_length`
- `pad`: the audio is padded.
"""
model_input_names = ["input_features", "is_longer"]
def __init__(
self,
feature_size=64,
sampling_rate=48_000,
hop_length=480,
max_length_s=10,
fft_window_size=1024,
padding_value=0.0,
return_attention_mask=False, # pad inputs to max length with silence token (zero) and no attention mask
frequency_min: float = 0,
frequency_max: float = 14_000,
top_db: int = None,
truncation: str = "fusion",
padding: str = "repeatpad",
**kwargs,
):
super().__init__(
feature_size=feature_size,
sampling_rate=sampling_rate,
padding_value=padding_value,
return_attention_mask=return_attention_mask,
**kwargs,
)
self.top_db = top_db
self.truncation = truncation
self.padding = padding
self.fft_window_size = fft_window_size
self.nb_frequency_bins = (fft_window_size >> 1) + 1
self.hop_length = hop_length
self.max_length_s = max_length_s
self.nb_max_samples = max_length_s * sampling_rate
self.sampling_rate = sampling_rate
self.frequency_min = frequency_min
self.frequency_max = frequency_max
self.mel_filters = get_mel_filter_banks(
nb_frequency_bins=self.nb_frequency_bins,
nb_mel_filters=feature_size,
frequency_min=frequency_min,
frequency_max=frequency_max,
sample_rate=sampling_rate,
norm=None,
mel_scale="htk",
)
self.mel_filters_slaney = get_mel_filter_banks(
nb_frequency_bins=self.nb_frequency_bins,
nb_mel_filters=feature_size,
frequency_min=frequency_min,
frequency_max=frequency_max,
sample_rate=sampling_rate,
norm="slaney",
mel_scale="slaney",
)
def to_dict(self) -> Dict[str, Any]:
"""
Serializes this instance to a Python dictionary.
Returns:
`Dict[str, Any]`: Dictionary of all the attributes that make up this configuration instance, excpet for the
mel filter banks, which do not need to be saved or printed as they are too long.
"""
output = copy.deepcopy(self.__dict__)
output["feature_extractor_type"] = self.__class__.__name__
if "mel_filters" in output:
del output["mel_filters"]
if "mel_filters_slaney" in output:
del output["mel_filters_slaney"]
return output
def _np_extract_fbank_features(self, waveform: np.array, mel_filters: Optional[np.array] = None) -> np.ndarray:
"""
Compute the log-Mel spectrogram of the provided `waveform` using the `hanning` window. In CLAP, two different
filter banks are used depending on the truncation pattern:
- `self.mel_filters`: they correspond to the defaults parameters of `torchaduio` which can be obtained from
calling `torchaudio.transforms.MelSpectrogram().mel_scale.fb`. These filters are used when `truncation`
is set to `"fusion"`.
- `self.mel_filteres_slaney` : they correspond to the defaults parameters of `torchlibrosa` which used
`librosa.filters.mel` when computing the mel spectrogram. These filters were only used in the original
implementation when the truncation mode is not `"fusion"`.
"""
window = np.hanning(self.fft_window_size + 1)[:-1]
frames = fram_wave(waveform, self.hop_length, self.fft_window_size)
spectrogram = stft(frames, window, fft_window_size=self.fft_window_size)
magnitudes = np.abs(spectrogram) ** 2
mel_spectrogram = np.matmul(mel_filters.T, magnitudes)
log_mel_spectrogram = power_to_db(mel_spectrogram).T
log_mel_spectrogram = np.asarray(log_mel_spectrogram, np.float32)
return log_mel_spectrogram
def _random_mel_fusion(self, mel, total_frames, chunk_frames):
ranges = np.array_split(list(range(0, total_frames - chunk_frames + 1)), 3)
if len(ranges[1]) == 0:
# if the audio is too short, we just use the first chunk
ranges[1] = [0]
if len(ranges[2]) == 0:
# if the audio is too short, we just use the first chunk
ranges[2] = [0]
# randomly choose index for each part
idx_front = np.random.choice(ranges[0])
idx_middle = np.random.choice(ranges[1])
idx_back = np.random.choice(ranges[2])
mel_chunk_front = mel[idx_front : idx_front + chunk_frames, :]
mel_chunk_middle = mel[idx_middle : idx_middle + chunk_frames, :]
mel_chunk_back = mel[idx_back : idx_back + chunk_frames, :]
mel = torch.tensor(mel[None, None, :])
mel_shrink = torch.nn.functional.interpolate(
mel, size=[chunk_frames, 64], mode="bilinear", align_corners=False, antialias=False
)
mel_shrink = mel_shrink[0][0].numpy()
mel_fusion = np.stack([mel_chunk_front, mel_chunk_middle, mel_chunk_back, mel_shrink], axis=0)
return mel_fusion
def _get_input_mel(self, waveform: np.array, max_length, truncation, padding) -> np.array:
"""
Extracts the mel spectrogram and prepares it for the mode based on the `truncation` and `padding` arguments.
Four different path are possible:
- `truncation="fusion"` and the length of the waveform is greater than the max length: the mel spectrogram
will be computed on the entire audio. 3 random crops and a dowsampled version of the full mel spectrogram
are then stacked together. They will later be used for `feature_fusion`.
- `truncation="rand_trunc"` and the length of the waveform is smaller than the max length: the audio is
padded based on `padding`.
- `truncation="fusion"` and the length of the waveform is smaller than the max length: the audio is padded
based on `padding`, and is repeated `4` times.
- `truncation="rand_trunc"` and the length of the waveform is greater than the max length: the mel
spectrogram will be computed on a random crop of the waveform.
"""
if waveform.shape[0] > max_length:
if truncation == "rand_trunc":
longer = True
# random crop to max_length (for compatibility) -> this should be handled by self.pad
overflow = len(waveform) - max_length
idx = np.random.randint(0, overflow + 1)
waveform = waveform[idx : idx + max_length]
input_mel = self._np_extract_fbank_features(waveform, self.mel_filters_slaney)[None, :]
elif truncation == "fusion":
mel = self._np_extract_fbank_features(waveform, self.mel_filters)
chunk_frames = max_length // self.hop_length + 1 # the +1 related to how the spectrogram is computed
total_frames = mel.shape[0]
if chunk_frames == total_frames:
# there is a corner case where the audio length is larger than max_length but smaller than max_length+hop_length.
# In this case, we just use the whole audio.
input_mel = np.stack([mel, mel, mel, mel], axis=0)
longer = False
else:
input_mel = self._random_mel_fusion(mel, total_frames, chunk_frames)
longer = True
else:
raise NotImplementedError(f"data_truncating {truncation} not implemented")
else:
longer = False
# only use repeat as a new possible value for padding. you repeat the audio before applying the usual max_length padding
if waveform.shape[0] < max_length:
if padding == "repeat":
n_repeat = int(max_length / len(waveform))
waveform = np.stack(np.tile(waveform, n_repeat + 1))[:max_length]
if padding == "repeatpad":
n_repeat = int(max_length / len(waveform))
waveform = np.stack(np.tile(waveform, n_repeat))
waveform = np.pad(waveform, (0, max_length - waveform.shape[0]), mode="constant", constant_values=0)
if truncation == "fusion":
input_mel = self._np_extract_fbank_features(waveform, self.mel_filters)
input_mel = np.stack([input_mel, input_mel, input_mel, input_mel], axis=0)
else:
input_mel = self._np_extract_fbank_features(waveform, self.mel_filters_slaney)[None, :]
return input_mel, longer
def __call__(
self,
raw_speech: Union[np.ndarray, List[float], List[np.ndarray], List[List[float]]],
truncation: str = None,
padding: Optional[str] = None,
max_length: Optional[int] = None,
sampling_rate: Optional[int] = None,
return_tensors: Optional[Union[str, TensorType]] = None,
**kwargs,
) -> BatchFeature:
"""
Main method to featurize and prepare for the model one or several sequence(s).
Args:
raw_speech (`np.ndarray`, `List[float]`, `List[np.ndarray]`, `List[List[float]]`):
The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a list of float
values, a list of numpy arrays or a list of list of float values.
truncation (`str`, *optional*):
Truncation pattern for long audio inputs. Two patterns are available:
- `fusion` will use `_random_mel_fusion`, which stacks 3 random crops from the mel spectrogram and
a downsampled version of the entire mel spectrogram.
If `config.fusion` is set to True, shorter audios also need to to return 4 mels, which will just be a
copy of the original mel obtained from the padded audio.
- `rand_trunc` will select a random crop of the mel spectrogram.
padding (`str`, *optional*):
Padding pattern for shorter audio inputs. Three patterns were originally implemented:
- `repeatpad`: the audio is repeated, and then padded to fit the `max_length`.
- `repeat`: the audio is repeated and then cut to fit the `max_length`
- `pad`: the audio is padded.
return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors instead of list of python integers. Acceptable values are:
- `'tf'`: Return TensorFlow `tf.constant` objects.
- `'pt'`: Return PyTorch `torch.np.array` objects.
- `'np'`: Return Numpy `np.ndarray` objects.
sampling_rate (`int`, *optional*):
The sampling rate at which the `raw_speech` input was sampled. It is strongly recommended to pass
`sampling_rate` at the forward call to prevent silent errors and allow automatic speech recognition
pipeline.
"""
truncation = truncation if truncation is not None else self.truncation
padding = padding if padding else self.padding
if sampling_rate is not None:
if sampling_rate != self.sampling_rate:
raise ValueError(
f"The model corresponding to this feature extractor: {self.__class__.__name__} was trained using a"
f" sampling rate of {self.sampling_rate}. Please make sure that the provided `raw_speech` input"
f" was sampled with {self.sampling_rate} and not {sampling_rate}."
)
else:
logger.warning(
"It is strongly recommended to pass the `sampling_rate` argument to this function. "
"Failing to do so can result in silent errors that might be hard to debug."
)
is_batched = bool(
isinstance(raw_speech, (list, tuple))
and (isinstance(raw_speech[0], np.ndarray) or isinstance(raw_speech[0], (tuple, list)))
)
if is_batched:
raw_speech = [np.asarray(speech, dtype=np.float64) for speech in raw_speech]
elif not is_batched and not isinstance(raw_speech, np.ndarray):
raw_speech = np.asarray(raw_speech, dtype=np.float64)
elif isinstance(raw_speech, np.ndarray) and raw_speech.dtype is np.dtype(np.float64):
raw_speech = raw_speech.astype(np.float64)
# always return batch
if not is_batched:
raw_speech = [np.asarray(raw_speech)]
# convert to mel spectrogram, truncate and pad if needed.
padded_inputs = [
self._get_input_mel(waveform, max_length if max_length else self.nb_max_samples, truncation, padding)
for waveform in raw_speech
]
input_mel = []
is_longer = []
for mel, longer in padded_inputs:
input_mel.append(mel)
is_longer.append(longer)
if truncation == "fusion" and sum(is_longer) == 0:
# if no audio is longer than 10s, then randomly select one audio to be longer
rand_idx = np.random.randint(0, len(input_mel))
is_longer[rand_idx] = True
if isinstance(input_mel[0], List):
input_mel = [np.asarray(feature, dtype=np.float64) for feature in input_mel]
input_features = {"input_features": input_mel, "is_longer": is_longer}
input_features = BatchFeature(input_features)
if return_tensors is not None:
input_features = input_features.convert_to_tensors(return_tensors)
return input_features
# coding=utf-8
# Copyright 2023 The LAION-AI Team and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" PyTorch CLAP model."""
import collections
import math
from dataclasses import dataclass
from typing import Any, List, Optional, Tuple, Union
import numpy as np
import torch
import torch.nn.functional as F
from torch import nn
from ...activations import ACT2FN
from ...modeling_outputs import (
BaseModelOutputWithPastAndCrossAttentions,
BaseModelOutputWithPooling,
BaseModelOutputWithPoolingAndCrossAttentions,
)
from ...modeling_utils import PreTrainedModel
from ...pytorch_utils import apply_chunking_to_forward, find_pruneable_heads_and_indices, meshgrid, prune_linear_layer
from ...utils import (
ModelOutput,
add_start_docstrings,
add_start_docstrings_to_model_forward,
logging,
replace_return_docstrings,
)
from .configuration_clap import ClapAudioConfig, ClapConfig, ClapTextConfig
logger = logging.get_logger(__name__)
_CHECKPOINT_FOR_DOC = "laion/clap-htsat-fused"
CLAP_PRETRAINED_MODEL_ARCHIVE_LIST = [
"laion/clap-htsat-fused",
"laion/clap-htsat-unfused",
# See all clap models at https://huggingface.co/models?filter=clap
]
# Adapted from: https://github.com/LAION-AI/CLAP/blob/6ad05a971ba0622f6acee8c41993e0d02bbed639/src/open_clip/utils.py#L191
def interpolate(hidden_states, ratio):
"""
Interpolate data in time domain. This is used to compensate the resolution reduction in downsampling of a CNN.
Args:
hidden_states (`torch.FloatTensor` of shape (batch_size, time_length, classes_num)):
Input hidden states
ratio (`int`):
The ratio of the length of the output to the length of the input.
"""
(batch_size, time_length, classes_num) = hidden_states.shape
upsampled = hidden_states[:, :, None, :].repeat(1, 1, ratio, 1)
upsampled = upsampled.reshape(batch_size, time_length * ratio, classes_num)
return upsampled
# Adapted from https://github.com/LAION-AI/CLAP/blob/6ad05a971ba0622f6acee8c41993e0d02bbed639/src/open_clip/htsat.py#L249
def window_partition(hidden_states, window_size):
"""
Returns the resized hidden states. The output shape should be `(batch_size * num_windows, window_size, window_size,
num_channels)`
Args:
hidden_states (`torch.FloatTensor` of shape `(batch_size, height, width, num_channels)`):
Input hidden states
window_size (`int`):
Window size
"""
batch_size, height, width, num_channels = hidden_states.shape
hidden_states = hidden_states.view(
batch_size, height // window_size, window_size, width // window_size, window_size, num_channels
)
windows = hidden_states.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, window_size, window_size, num_channels)
return windows
# Adapted from https://github.com/LAION-AI/CLAP/blob/6ad05a971ba0622f6acee8c41993e0d02bbed639/src/open_clip/htsat.py#L263
def window_reverse(windows, window_size, height, width):
"""
Args:
windows (`torch.FloatTensor` of shape `(num_windows * batch_size, window_size, window_size, num_channels)`):
Input windows
window_size (`int`):
Window size
height (`int`):
Height of the resized audio
width (`int`):
Width of the resized audio
"""
batch_size = int(windows.shape[0] / (height * width / window_size / window_size))
hidden_states = windows.view(batch_size, height // window_size, width // window_size, window_size, window_size, -1)
hidden_states = hidden_states.permute(0, 1, 3, 2, 4, 5).contiguous().view(batch_size, height, width, -1)
return hidden_states
# Copied from transformers.models.roberta.modeling_roberta.create_position_ids_from_input_ids
def create_position_ids_from_input_ids(input_ids, padding_idx, past_key_values_length=0):
"""
Replace non-padding symbols with their position numbers. Position numbers begin at padding_idx+1. Padding symbols
are ignored. This is modified from fairseq's `utils.make_positions`.
Args:
x: torch.Tensor x:
Returns: torch.Tensor
"""
# The series of casts and type-conversions here are carefully balanced to both work with ONNX export and XLA.
mask = input_ids.ne(padding_idx).int()
incremental_indices = (torch.cumsum(mask, dim=1).type_as(mask) + past_key_values_length) * mask
return incremental_indices.long() + padding_idx
# contrastive loss function, adapted from
# https://sachinruk.github.io/blog/pytorch/pytorch%20lightning/loss%20function/gpu/2021/03/07/CLIP.html#CLIP-loss-function
def contrastive_loss(logits: torch.Tensor) -> torch.Tensor:
labels = torch.arange(len(logits), device=logits.device)
return nn.functional.cross_entropy(logits, labels)
@dataclass
# Copied from transformers.models.clip.modeling_clip.CLIPTextModelOutput with CLIP->Clap
class ClapTextModelOutput(ModelOutput):
"""
Base class for text model's outputs that also contains a pooling of the last hidden states.
Args:
text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
The text embeddings obtained by applying the projection layer to the pooler_output.
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
"""
text_embeds: Optional[torch.FloatTensor] = None
last_hidden_state: torch.FloatTensor = None
hidden_states: Optional[Tuple[torch.FloatTensor]] = None
attentions: Optional[Tuple[torch.FloatTensor]] = None
@dataclass
class ClapAudioModelOutput(ModelOutput):
"""
ClapAudio model output to mimic the output of the original implementation.
Args:
audio_embeds (`torch.FloatTensor` of shape `(batch_size, hidden_size)`):
The Audio embeddings obtained by applying the projection layer to the pooler_output.
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
"""
audio_embeds: Optional[torch.FloatTensor] = None
last_hidden_state: torch.FloatTensor = None
hidden_states: Optional[Tuple[torch.FloatTensor]] = None
attentions: Optional[Tuple[torch.FloatTensor]] = None
@dataclass
# Copied from transformers.models.clip.modeling_clip.CLIPOutput with CLIP->Clap, vision->audio, Vision->Audio, image->audio
class ClapOutput(ModelOutput):
"""
Args:
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`):
Contrastive loss for audio-text similarity.
logits_per_audio:(`torch.FloatTensor` of shape `(audio_batch_size, text_batch_size)`):
The scaled dot product scores between `audio_embeds` and `text_embeds`. This represents the audio-text
similarity scores.
logits_per_text:(`torch.FloatTensor` of shape `(text_batch_size, audio_batch_size)`):
The scaled dot product scores between `text_embeds` and `audio_embeds`. This represents the text-audio
similarity scores.
text_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`):
The text embeddings obtained by applying the projection layer to the pooled output of [`ClapTextModel`].
audio_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`):
The audio embeddings obtained by applying the projection layer to the pooled output of [`ClapAudioModel`].
text_model_output(`BaseModelOutputWithPooling`):
The output of the [`ClapTextModel`].
audio_model_output(`BaseModelOutputWithPooling`):
The output of the [`ClapAudioModel`].
"""
loss: Optional[torch.FloatTensor] = None
logits_per_audio: torch.FloatTensor = None
logits_per_text: torch.FloatTensor = None
text_embeds: torch.FloatTensor = None
audio_embeds: torch.FloatTensor = None
text_model_output: BaseModelOutputWithPooling = None
audio_model_output: BaseModelOutputWithPooling = None
def to_tuple(self) -> Tuple[Any]:
return tuple(
self[k] if k not in ["text_model_output", "audio_model_output"] else getattr(self, k).to_tuple()
for k in self.keys()
)
# Adapted from transformers.models.swin.modeling_swin.SwinDropPath
class ClapDropPath(nn.Module):
"""
Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks). This is a slightly
refactored version of the `SwinDropPath` implementation.
"""
def __init__(self, drop_prob=None):
super().__init__()
self.drop_prob = drop_prob
def forward(self, hidden_states):
if self.drop_prob == 0.0 or not self.training:
return hidden_states
keep_prob = 1 - self.drop_prob
# work with diff dim tensors, not just 2D ConvNets
shape = (hidden_states.shape[0],) + (1,) * (hidden_states.ndim - 1)
random_tensor = keep_prob + torch.rand(shape, dtype=hidden_states.dtype, device=hidden_states.device)
random_tensor.floor_() # binarize
output = hidden_states.div(keep_prob) * random_tensor
return output
# Adapted from https://github.com/LAION-AI/CLAP/blob/6ad05a971ba0622f6acee8c41993e0d02bbed639/src/open_clip/feature_fusion.py#L133
class ClapAudioAFFBlock(nn.Module):
r"""
ATTENTIONAL FEATURE FUSION Block from CLAP, since in CLAP we are always in 2D mode, it is not needed to implement
the 1D version.
"""
def __init__(self, config: ClapAudioConfig):
super().__init__()
channels = config.patch_embeds_hidden_size
downsize_ratio = config.aff_block_r
inter_channels = int(channels // downsize_ratio)
self.local_att = nn.Sequential(
nn.Conv2d(channels, inter_channels, kernel_size=1, stride=1, padding=0),
nn.BatchNorm2d(inter_channels),
nn.ReLU(inplace=True),
nn.Conv2d(inter_channels, channels, kernel_size=1, stride=1, padding=0),
nn.BatchNorm2d(channels),
)
self.global_att = nn.Sequential(
nn.AdaptiveAvgPool2d(1),
nn.Conv2d(channels, inter_channels, kernel_size=1, stride=1, padding=0),
nn.BatchNorm2d(inter_channels),
nn.ReLU(inplace=True),
nn.Conv2d(inter_channels, channels, kernel_size=1, stride=1, padding=0),
nn.BatchNorm2d(channels),
)
self.sigmoid = nn.Sigmoid()
def forward(self, hidden_states, residual):
attention_input = hidden_states + residual
fused_layer_output = self.local_att(attention_input) + self.global_att(attention_input)
fused_layer_output = self.sigmoid(fused_layer_output)
output = 2 * hidden_states * fused_layer_output + 2 * residual * (1 - fused_layer_output)
return output
class ClapAudioPatchEmbed(nn.Module):
"""
This module converts the hidden states reshaped as an image to patch embeddings ready to be passed to the
Transformer block.
"""
def __init__(self, config: ClapAudioConfig):
super().__init__()
img_size = (config.spec_size, config.spec_size) if isinstance(config.spec_size, int) else config.spec_size
patch_size = (
(config.patch_size, config.patch_size) if isinstance(config.patch_size, int) else config.patch_size
)
patch_stride = (
(config.patch_stride, config.patch_stride) if isinstance(config.patch_stride, int) else config.patch_stride
)
self.img_size = img_size
self.patch_stride = patch_stride
self.grid_size = (img_size[0] // patch_stride[0], img_size[1] // patch_stride[1])
self.num_patches = self.grid_size[0] * self.grid_size[1]
self.flatten = config.flatten_patch_embeds
self.enable_fusion = config.enable_fusion
padding = ((patch_size[0] - patch_stride[0]) // 2, (patch_size[1] - patch_stride[1]) // 2)
scale_factor = 4 if (self.enable_fusion) and (config.fusion_type == "channel_map") else 1
self.proj = nn.Conv2d(
config.patch_embed_input_channels * scale_factor,
config.patch_embeds_hidden_size,
kernel_size=patch_size,
stride=patch_stride,
padding=padding,
)
self.norm = nn.LayerNorm(config.patch_embeds_hidden_size) if config.enable_patch_layer_norm else nn.Identity()
if self.enable_fusion:
self.fusion_model = ClapAudioAFFBlock(config)
self.mel_conv2d = nn.Conv2d(
config.patch_embed_input_channels,
config.patch_embeds_hidden_size,
kernel_size=(patch_size[0], patch_size[1] * 3),
stride=(patch_stride[0], patch_stride[1] * 3),
padding=padding,
)
def forward(self, hidden_states, is_longer_idx=None):
if self.enable_fusion:
# retrieve the last mel as we have transposed the input
global_hidden_states = hidden_states[:, 0:1, :, :]
# global processing
batch_size, num_channels, height, width = global_hidden_states.shape
if height != self.img_size[0] or width != self.img_size[1]:
raise ValueError(
f"Input audio size ({height}*{width}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})."
)
global_hidden_states = self.proj(global_hidden_states)
output_width = global_hidden_states.size(-1)
if len(is_longer_idx) > 0:
# local processing
local_hidden_states = hidden_states[is_longer_idx, 1:, :, :].contiguous()
batch_size, num_channels, height, width = local_hidden_states.shape
local_hidden_states = local_hidden_states.view(batch_size * num_channels, 1, height, width)
local_hidden_states = self.mel_conv2d(local_hidden_states)
_, features, height, width = local_hidden_states.shape
local_hidden_states = local_hidden_states.view(batch_size, num_channels, features, height, width)
local_hidden_states = local_hidden_states.permute((0, 2, 3, 1, 4)).contiguous().flatten(3)
local_width = local_hidden_states.size(-1)
local_hidden_states = torch.nn.functional.pad(
local_hidden_states, (0, output_width - local_width), "constant", 0
)
global_hidden_states[is_longer_idx] = self.fusion_model(
global_hidden_states[is_longer_idx], local_hidden_states
)
hidden_states = global_hidden_states
else:
_, _, height, width = hidden_states.shape
if height != self.img_size[0] or width != self.img_size[1]:
raise ValueError(
f"Input audio size ({height}*{width}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})."
)
hidden_states = self.proj(hidden_states)
if self.flatten:
hidden_states = hidden_states.flatten(2).transpose(1, 2)
hidden_states = self.norm(hidden_states)
return hidden_states
# Copied from transformers.models.swin.modeling_swin.SwinSelfAttention with Swin->ClapAudio
class ClapAudioSelfAttention(nn.Module):
def __init__(self, config, dim, num_heads, window_size):
super().__init__()
if dim % num_heads != 0:
raise ValueError(
f"The hidden size ({dim}) is not a multiple of the number of attention heads ({num_heads})"
)
self.num_attention_heads = num_heads
self.attention_head_size = int(dim / num_heads)
self.all_head_size = self.num_attention_heads * self.attention_head_size
self.window_size = (
window_size if isinstance(window_size, collections.abc.Iterable) else (window_size, window_size)
)
self.relative_position_bias_table = nn.Parameter(
torch.zeros((2 * self.window_size[0] - 1) * (2 * self.window_size[1] - 1), num_heads)
)
# get pair-wise relative position index for each token inside the window
coords_h = torch.arange(self.window_size[0])
coords_w = torch.arange(self.window_size[1])
coords = torch.stack(meshgrid([coords_h, coords_w], indexing="ij"))
coords_flatten = torch.flatten(coords, 1)
relative_coords = coords_flatten[:, :, None] - coords_flatten[:, None, :]
relative_coords = relative_coords.permute(1, 2, 0).contiguous()
relative_coords[:, :, 0] += self.window_size[0] - 1
relative_coords[:, :, 1] += self.window_size[1] - 1
relative_coords[:, :, 0] *= 2 * self.window_size[1] - 1
relative_position_index = relative_coords.sum(-1)
self.register_buffer("relative_position_index", relative_position_index)
self.query = nn.Linear(self.all_head_size, self.all_head_size, bias=config.qkv_bias)
self.key = nn.Linear(self.all_head_size, self.all_head_size, bias=config.qkv_bias)
self.value = nn.Linear(self.all_head_size, self.all_head_size, bias=config.qkv_bias)
self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
def transpose_for_scores(self, x):
new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
x = x.view(new_x_shape)
return x.permute(0, 2, 1, 3)
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.FloatTensor] = None,
head_mask: Optional[torch.FloatTensor] = None,
output_attentions: Optional[bool] = False,
) -> Tuple[torch.Tensor]:
batch_size, dim, num_channels = hidden_states.shape
mixed_query_layer = self.query(hidden_states)
key_layer = self.transpose_for_scores(self.key(hidden_states))
value_layer = self.transpose_for_scores(self.value(hidden_states))
query_layer = self.transpose_for_scores(mixed_query_layer)
# Take the dot product between "query" and "key" to get the raw attention scores.
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
relative_position_bias = self.relative_position_bias_table[self.relative_position_index.view(-1)]
relative_position_bias = relative_position_bias.view(
self.window_size[0] * self.window_size[1], self.window_size[0] * self.window_size[1], -1
)
relative_position_bias = relative_position_bias.permute(2, 0, 1).contiguous()
attention_scores = attention_scores + relative_position_bias.unsqueeze(0)
if attention_mask is not None:
# Apply the attention mask is (precomputed for all layers in ClapAudioModel forward() function)
mask_shape = attention_mask.shape[0]
attention_scores = attention_scores.view(
batch_size // mask_shape, mask_shape, self.num_attention_heads, dim, dim
)
attention_scores = attention_scores + attention_mask.unsqueeze(1).unsqueeze(0)
attention_scores = attention_scores.view(-1, self.num_attention_heads, dim, dim)
# Normalize the attention scores to probabilities.
attention_probs = nn.functional.softmax(attention_scores, dim=-1)
# This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper.
attention_probs = self.dropout(attention_probs)
# Mask heads if we want to
if head_mask is not None:
attention_probs = attention_probs * head_mask
context_layer = torch.matmul(attention_probs, value_layer)
context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
context_layer = context_layer.view(new_context_layer_shape)
outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
return outputs
# Copied from transformers.models.swin.modeling_swin.SwinSelfOutput with Swin->ClapAudio
class ClapAudioSelfOutput(nn.Module):
def __init__(self, config, dim):
super().__init__()
self.dense = nn.Linear(dim, dim)
self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor:
hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states)
return hidden_states
# Copied from transformers.models.swin.modeling_swin.SwinAttention with Swin->ClapAudio
class ClapAudioAttention(nn.Module):
def __init__(self, config, dim, num_heads, window_size):
super().__init__()
self.self = ClapAudioSelfAttention(config, dim, num_heads, window_size)
self.output = ClapAudioSelfOutput(config, dim)
self.pruned_heads = set()
def prune_heads(self, heads):
if len(heads) == 0:
return
heads, index = find_pruneable_heads_and_indices(
heads, self.self.num_attention_heads, self.self.attention_head_size, self.pruned_heads
)
# Prune linear layers
self.self.query = prune_linear_layer(self.self.query, index)
self.self.key = prune_linear_layer(self.self.key, index)
self.self.value = prune_linear_layer(self.self.value, index)
self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
# Update hyper params and store pruned heads
self.self.num_attention_heads = self.self.num_attention_heads - len(heads)
self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads
self.pruned_heads = self.pruned_heads.union(heads)
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.FloatTensor] = None,
head_mask: Optional[torch.FloatTensor] = None,
output_attentions: Optional[bool] = False,
) -> Tuple[torch.Tensor]:
self_outputs = self.self(hidden_states, attention_mask, head_mask, output_attentions)
attention_output = self.output(self_outputs[0], hidden_states)
outputs = (attention_output,) + self_outputs[1:] # add attentions if we output them
return outputs
# Copied from transformers.models.swin.modeling_swin.SwinIntermediate with Swin->ClapAudio
class ClapAudioIntermediate(nn.Module):
def __init__(self, config, dim):
super().__init__()
self.dense = nn.Linear(dim, int(config.mlp_ratio * dim))
if isinstance(config.hidden_act, str):
self.intermediate_act_fn = ACT2FN[config.hidden_act]
else:
self.intermediate_act_fn = config.hidden_act
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
hidden_states = self.dense(hidden_states)
hidden_states = self.intermediate_act_fn(hidden_states)
return hidden_states
# Copied from transformers.models.swin.modeling_swin.SwinOutput with Swin->ClapAudio
class ClapAudioOutput(nn.Module):
def __init__(self, config, dim):
super().__init__()
self.dense = nn.Linear(int(config.mlp_ratio * dim), dim)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states)
return hidden_states
# Copied from transformers.models.swin.modeling_swin.SwinLayer with SwinDropPath->ClapDropPath, Swin->ClapAudio
class ClapAudioLayer(nn.Module):
def __init__(self, config, dim, input_resolution, num_heads, shift_size=0):
super().__init__()
self.chunk_size_feed_forward = config.chunk_size_feed_forward
self.shift_size = shift_size
self.window_size = config.window_size
self.input_resolution = input_resolution
self.layernorm_before = nn.LayerNorm(dim, eps=config.layer_norm_eps)
self.attention = ClapAudioAttention(config, dim, num_heads, window_size=self.window_size)
self.drop_path = ClapDropPath(config.drop_path_rate) if config.drop_path_rate > 0.0 else nn.Identity()
self.layernorm_after = nn.LayerNorm(dim, eps=config.layer_norm_eps)
self.intermediate = ClapAudioIntermediate(config, dim)
self.output = ClapAudioOutput(config, dim)
def set_shift_and_window_size(self, input_resolution):
if min(input_resolution) <= self.window_size:
# if window size is larger than input resolution, we don't partition windows
self.shift_size = 0
self.window_size = min(input_resolution)
def get_attn_mask(self, height, width, dtype):
if self.shift_size > 0:
# calculate attention mask for SW-MSA
img_mask = torch.zeros((1, height, width, 1), dtype=dtype)
height_slices = (
slice(0, -self.window_size),
slice(-self.window_size, -self.shift_size),
slice(-self.shift_size, None),
)
width_slices = (
slice(0, -self.window_size),
slice(-self.window_size, -self.shift_size),
slice(-self.shift_size, None),
)
count = 0
for height_slice in height_slices:
for width_slice in width_slices:
img_mask[:, height_slice, width_slice, :] = count
count += 1
mask_windows = window_partition(img_mask, self.window_size)
mask_windows = mask_windows.view(-1, self.window_size * self.window_size)
attn_mask = mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
attn_mask = attn_mask.masked_fill(attn_mask != 0, float(-100.0)).masked_fill(attn_mask == 0, float(0.0))
else:
attn_mask = None
return attn_mask
def maybe_pad(self, hidden_states, height, width):
pad_right = (self.window_size - width % self.window_size) % self.window_size
pad_bottom = (self.window_size - height % self.window_size) % self.window_size
pad_values = (0, 0, 0, pad_right, 0, pad_bottom)
hidden_states = nn.functional.pad(hidden_states, pad_values)
return hidden_states, pad_values
def forward(
self,
hidden_states: torch.Tensor,
input_dimensions: Tuple[int, int],
head_mask: Optional[torch.FloatTensor] = None,
output_attentions: Optional[bool] = False,
always_partition: Optional[bool] = False,
) -> Tuple[torch.Tensor, torch.Tensor]:
if not always_partition:
self.set_shift_and_window_size(input_dimensions)
else:
pass
height, width = input_dimensions
batch_size, _, channels = hidden_states.size()
shortcut = hidden_states
hidden_states = self.layernorm_before(hidden_states)
hidden_states = hidden_states.view(batch_size, height, width, channels)
# pad hidden_states to multiples of window size
hidden_states, pad_values = self.maybe_pad(hidden_states, height, width)
_, height_pad, width_pad, _ = hidden_states.shape
# cyclic shift
if self.shift_size > 0:
shifted_hidden_states = torch.roll(hidden_states, shifts=(-self.shift_size, -self.shift_size), dims=(1, 2))
else:
shifted_hidden_states = hidden_states
# partition windows
hidden_states_windows = window_partition(shifted_hidden_states, self.window_size)
hidden_states_windows = hidden_states_windows.view(-1, self.window_size * self.window_size, channels)
attn_mask = self.get_attn_mask(height_pad, width_pad, dtype=hidden_states.dtype)
if attn_mask is not None:
attn_mask = attn_mask.to(hidden_states_windows.device)
attention_outputs = self.attention(
hidden_states_windows, attn_mask, head_mask, output_attentions=output_attentions
)
attention_output = attention_outputs[0]
attention_windows = attention_output.view(-1, self.window_size, self.window_size, channels)
shifted_windows = window_reverse(attention_windows, self.window_size, height_pad, width_pad)
# reverse cyclic shift
if self.shift_size > 0:
attention_windows = torch.roll(shifted_windows, shifts=(self.shift_size, self.shift_size), dims=(1, 2))
else:
attention_windows = shifted_windows
was_padded = pad_values[3] > 0 or pad_values[5] > 0
if was_padded:
attention_windows = attention_windows[:, :height, :width, :].contiguous()
attention_windows = attention_windows.view(batch_size, height * width, channels)
hidden_states = shortcut + self.drop_path(attention_windows)
layer_output = self.layernorm_after(hidden_states)
layer_output = self.intermediate(layer_output)
layer_output = hidden_states + self.output(layer_output)
layer_outputs = (layer_output, attention_outputs[1]) if output_attentions else (layer_output,)
return layer_outputs
# Copied from transformers.models.swin.modeling_swin.SwinStage with Swin->ClapAudio
class ClapAudioStage(nn.Module):
def __init__(self, config, dim, input_resolution, depth, num_heads, drop_path, downsample):
super().__init__()
self.config = config
self.dim = dim
self.blocks = nn.ModuleList(
[
ClapAudioLayer(
config=config,
dim=dim,
input_resolution=input_resolution,
num_heads=num_heads,
shift_size=0 if (i % 2 == 0) else config.window_size // 2,
)
for i in range(depth)
]
)
# patch merging layer
if downsample is not None:
self.downsample = downsample(input_resolution, dim=dim, norm_layer=nn.LayerNorm)
else:
self.downsample = None
self.pointing = False
def forward(
self,
hidden_states: torch.Tensor,
input_dimensions: Tuple[int, int],
head_mask: Optional[torch.FloatTensor] = None,
output_attentions: Optional[bool] = False,
always_partition: Optional[bool] = False,
) -> Tuple[torch.Tensor]:
height, width = input_dimensions
for i, layer_module in enumerate(self.blocks):
layer_head_mask = head_mask[i] if head_mask is not None else None
layer_outputs = layer_module(
hidden_states, input_dimensions, layer_head_mask, output_attentions, always_partition
)
hidden_states = layer_outputs[0]
hidden_states_before_downsampling = hidden_states
if self.downsample is not None:
height_downsampled, width_downsampled = (height + 1) // 2, (width + 1) // 2
output_dimensions = (height, width, height_downsampled, width_downsampled)
hidden_states = self.downsample(hidden_states_before_downsampling, input_dimensions)
else:
output_dimensions = (height, width, height, width)
stage_outputs = (hidden_states, hidden_states_before_downsampling, output_dimensions)
if output_attentions:
stage_outputs += layer_outputs[1:]
return stage_outputs
# Copied from transformers.models.swin.modeling_swin.SwinPatchMerging with Swin->ClapAudio
class ClapAudioPatchMerging(nn.Module):
"""
Patch Merging Layer.
Args:
input_resolution (`Tuple[int]`):
Resolution of input feature.
dim (`int`):
Number of input channels.
norm_layer (`nn.Module`, *optional*, defaults to `nn.LayerNorm`):
Normalization layer class.
"""
def __init__(self, input_resolution: Tuple[int], dim: int, norm_layer: nn.Module = nn.LayerNorm) -> None:
super().__init__()
self.input_resolution = input_resolution
self.dim = dim
self.reduction = nn.Linear(4 * dim, 2 * dim, bias=False)
self.norm = norm_layer(4 * dim)
def maybe_pad(self, input_feature, height, width):
should_pad = (height % 2 == 1) or (width % 2 == 1)
if should_pad:
pad_values = (0, 0, 0, width % 2, 0, height % 2)
input_feature = nn.functional.pad(input_feature, pad_values)
return input_feature
def forward(self, input_feature: torch.Tensor, input_dimensions: Tuple[int, int]) -> torch.Tensor:
height, width = input_dimensions
# `dim` is height * width
batch_size, dim, num_channels = input_feature.shape
input_feature = input_feature.view(batch_size, height, width, num_channels)
# pad input to be disible by width and height, if needed
input_feature = self.maybe_pad(input_feature, height, width)
# [batch_size, height/2, width/2, num_channels]
input_feature_0 = input_feature[:, 0::2, 0::2, :]
# [batch_size, height/2, width/2, num_channels]
input_feature_1 = input_feature[:, 1::2, 0::2, :]
# [batch_size, height/2, width/2, num_channels]
input_feature_2 = input_feature[:, 0::2, 1::2, :]
# [batch_size, height/2, width/2, num_channels]
input_feature_3 = input_feature[:, 1::2, 1::2, :]
# batch_size height/2 width/2 4*num_channels
input_feature = torch.cat([input_feature_0, input_feature_1, input_feature_2, input_feature_3], -1)
input_feature = input_feature.view(batch_size, -1, 4 * num_channels) # batch_size height/2*width/2 4*C
input_feature = self.norm(input_feature)
input_feature = self.reduction(input_feature)
return input_feature
class ClapAudioEncoder(nn.Module):
def __init__(self, config):
super().__init__()
self.num_layers = len(config.depths)
self.config = config
self.patch_embed = ClapAudioPatchEmbed(config)
self.enable_fusion = config.enable_fusion
grid_size = self.patch_embed.grid_size
self.patch_stride = self.patch_embed.patch_stride
self.spec_size = config.spec_size
self.freq_ratio = self.spec_size // config.num_mel_bins
self.num_features = int(config.patch_embeds_hidden_size * 2 ** (self.num_layers - 1))
self.freq_ratio = config.spec_size // config.num_mel_bins
drop_path_rate = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths))]
self.input_resolutions = [(grid_size[0] // (2**i), grid_size[1] // (2**i)) for i in range(self.num_layers)]
self.layers = nn.ModuleList(
[
ClapAudioStage(
config=config,
dim=int(config.patch_embeds_hidden_size * 2**i_layer),
input_resolution=self.input_resolutions[i_layer],
depth=config.depths[i_layer],
num_heads=config.num_attention_heads[i_layer],
drop_path=drop_path_rate[sum(config.depths[:i_layer]) : sum(config.depths[: i_layer + 1])],
downsample=ClapAudioPatchMerging if (i_layer < self.num_layers - 1) else None,
)
for i_layer in range(self.num_layers)
]
)
self.gradient_checkpointing = False
self.batch_norm = nn.BatchNorm2d(config.num_mel_bins)
self.norm = nn.LayerNorm(self.num_features)
self.depths = config.depths
self.avgpool = nn.AdaptiveAvgPool1d(1)
def reshape_mel2img(self, normalized_input_features):
"""
The input is 4 normalized log mel spectrograms. It is reshape to the common shape of images. Each channel
should represent 1 of the 4 crops of the spectrogram. For more details, refer to the [`ClapFeatureExtractor`].
"""
_, _, time_length, freq_length = normalized_input_features.shape
spec_width = int(self.spec_size * self.freq_ratio)
spec_heigth = self.spec_size // self.freq_ratio
if time_length > spec_width or freq_length > spec_heigth:
raise ValueError("the wav size should be less than or equal to the swin input size")
# to avoid bicubic zero error
if time_length < spec_width:
normalized_input_features = nn.functional.interpolate(
normalized_input_features, (spec_width, freq_length), mode="bicubic", align_corners=True
)
if freq_length < spec_heigth:
normalized_input_features = nn.functional.interpolate(
normalized_input_features, (time_length, spec_heigth), mode="bicubic", align_corners=True
)
batch, channels, time, freq = normalized_input_features.shape
# batch_size, channels, spec_width, spec_heigth --> batch_size, channels, spec_heigth * freq_ratio, spec_width // freq_ratio
normalized_input_features = normalized_input_features.reshape(
batch, channels * self.freq_ratio, time // self.freq_ratio, freq
)
normalized_input_features = normalized_input_features.permute(0, 1, 3, 2).contiguous()
normalized_input_features = normalized_input_features.reshape(
batch, channels, freq * self.freq_ratio, time // self.freq_ratio
)
return normalized_input_features
def forward(
self,
input_features,
head_mask: Optional[torch.FloatTensor] = None,
is_longer: Optional[torch.FloatTensor] = None,
output_attentions: Optional[bool] = False,
output_hidden_states: Optional[bool] = False,
output_hidden_states_before_downsampling: Optional[bool] = False,
always_partition: Optional[bool] = False,
return_dict: Optional[bool] = True,
) -> Union[Tuple, ClapAudioModelOutput]:
input_features = input_features.transpose(1, 3)
normalized_input_features = self.batch_norm(input_features)
normalized_input_features = normalized_input_features.transpose(1, 3)
is_longer_list_idx = None
if self.enable_fusion:
is_longer_list = is_longer.to(input_features.device)
is_longer_list_idx = torch.where(is_longer_list == 1)[0]
hidden_states = self.reshape_mel2img(normalized_input_features)
frames_num = hidden_states.shape[2]
hidden_states = self.patch_embed(hidden_states, is_longer_list_idx)
all_hidden_states = () if output_hidden_states else None
all_reshaped_hidden_states = () if output_hidden_states else None
all_self_attentions = () if output_attentions else None
input_dimensions = self.input_resolutions[0]
if output_hidden_states:
batch_size, _, hidden_size = hidden_states.shape
# rearrange batch_size (height width) channels -> batch_size channel height width
reshaped_hidden_state = hidden_states.view(batch_size, *input_dimensions, hidden_size)
reshaped_hidden_state = reshaped_hidden_state.permute(0, 3, 1, 2)
all_hidden_states += (hidden_states,)
all_reshaped_hidden_states += (reshaped_hidden_state,)
for i, layer_module in enumerate(self.layers):
layer_head_mask = head_mask[i] if head_mask is not None else None
input_dimensions = self.input_resolutions[i]
if self.gradient_checkpointing and self.training:
def create_custom_forward(module):
def custom_forward(*inputs):
return module(*inputs, output_attentions)
return custom_forward
layer_outputs = torch.utils.checkpoint.checkpoint(
create_custom_forward(layer_module), hidden_states, input_dimensions, layer_head_mask
)
else:
layer_outputs = layer_module(
hidden_states, input_dimensions, layer_head_mask, output_attentions, always_partition
)
hidden_states = layer_outputs[0]
hidden_states_before_downsampling = layer_outputs[1]
output_dimensions = layer_outputs[2]
input_dimensions = (output_dimensions[-2], output_dimensions[-1])
if output_hidden_states and output_hidden_states_before_downsampling:
batch_size, _, hidden_size = hidden_states_before_downsampling.shape
# rearrange batch_size (height width) channels -> batch_size channel height width
# here we use the original (not downsampled) height and width
reshaped_hidden_state = hidden_states_before_downsampling.view(
batch_size, *(output_dimensions[0], output_dimensions[1]), hidden_size
)
reshaped_hidden_state = reshaped_hidden_state.permute(0, 3, 1, 2)
all_hidden_states += (hidden_states_before_downsampling,)
all_reshaped_hidden_states += (reshaped_hidden_state,)
elif output_hidden_states and not output_hidden_states_before_downsampling:
batch_size, _, hidden_size = hidden_states.shape
# rearrange batch_size (height width) channels -> batch_size channel height width
reshaped_hidden_state = hidden_states.view(batch_size, *input_dimensions, hidden_size)
reshaped_hidden_state = reshaped_hidden_state.permute(0, 3, 1, 2)
all_hidden_states += (hidden_states,)
all_reshaped_hidden_states += (reshaped_hidden_state,)
if output_attentions:
all_self_attentions += layer_outputs[3:]
last_hidden_state = self.norm(hidden_states)
batch_size, _, n_channels = last_hidden_state.shape
freq_shape = frames_num // (2 ** (len(self.depths) - 1)) // self.patch_stride[0]
temporal_shape = frames_num // (2 ** (len(self.depths) - 1)) // self.patch_stride[1]
last_hidden_state = (
last_hidden_state.permute(0, 2, 1).contiguous().reshape(batch_size, n_channels, freq_shape, temporal_shape)
)
batch_size, n_channels, n_frequencies, n_temp = last_hidden_state.shape
# group 2D CNN
c_freq_bin = n_frequencies // self.freq_ratio
last_hidden_state = last_hidden_state.reshape(
batch_size, n_channels, n_frequencies // c_freq_bin, c_freq_bin, n_temp
)
last_hidden_state = (
last_hidden_state.permute(0, 1, 3, 2, 4).contiguous().reshape(batch_size, n_channels, c_freq_bin, -1)
)
latent_output = self.avgpool(torch.flatten(last_hidden_state, 2))
latent_output = torch.flatten(latent_output, 1)
if not return_dict:
return tuple(
v
for v in [
last_hidden_state,
latent_output,
all_reshaped_hidden_states,
all_self_attentions,
]
if v is not None
)
return BaseModelOutputWithPooling(
last_hidden_state=last_hidden_state,
pooler_output=latent_output,
hidden_states=all_reshaped_hidden_states,
attentions=all_self_attentions,
)
CLAP_START_DOCSTRING = r"""
This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)
This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.
Parameters:
config ([`ClapConfig`]): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
"""
CLAP_TEXT_INPUTS_DOCSTRING = r"""
Args:
input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
it.
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.__call__`] for details.
[What are input IDs?](../glossary#input-ids)
attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
[What are attention masks?](../glossary#attention-mask)
position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
config.max_position_embeddings - 1]`.
[What are position IDs?](../glossary#position-ids)
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
tensors for more detail.
output_hidden_states (`bool`, *optional*):
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
more detail.
return_dict (`bool`, *optional*):
Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
"""
CLAP_AUDIO_INPUTS_DOCSTRING = r"""
Args:
input_features (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
Input audio features. This should be returnes by the [`ClapFeatureExtractor`] class that you can also
retrieve from [`AutoFeatureExtractor`]. See [`ClapFeatureExtractor.__call__`] for details.
is_longer (`torch.FloatTensor`, of shape `(batch_size, 1)`, *optional*):
Whether the audio clip is longer than `max_length`. If `True`, a feature fusion will be enabled to enhance
the features.
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
tensors for more detail.
output_hidden_states (`bool`, *optional*):
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
more detail.
return_dict (`bool`, *optional*):
Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
"""
CLAP_INPUTS_DOCSTRING = r"""
Args:
input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
it.
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.__call__`] for details.
[What are input IDs?](../glossary#input-ids)
attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
[What are attention masks?](../glossary#attention-mask)
position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
config.max_position_embeddings - 1]`.
[What are position IDs?](../glossary#position-ids)
input_features (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
Input audio features. This should be returnes by the [`ClapFeatureExtractor`] class that you can also
retrieve from [`AutoFeatureExtractor`]. See [`ClapFeatureExtractor.__call__`] for details.
return_loss (`bool`, *optional*):
Whether or not to return the contrastive loss.
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
tensors for more detail.
output_hidden_states (`bool`, *optional*):
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
more detail.
return_dict (`bool`, *optional*):
Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
"""
class ClapProjectionLayer(nn.Module):
def __init__(self, config: Union[ClapAudioConfig, ClapTextConfig]):
super().__init__()
self.config = config
hidden_size = config.hidden_size
projection_dim = config.projection_dim
self.linear1 = nn.Linear(hidden_size, projection_dim)
self.activation = ACT2FN[config.projection_hidden_act]
self.linear2 = nn.Linear(projection_dim, projection_dim)
def forward(self, hidden_states):
hidden_states = self.linear1(hidden_states)
hidden_states = self.activation(hidden_states)
hidden_states = self.linear2(hidden_states)
return hidden_states
# Copied from transformers.models.roberta.modeling_roberta.RobertaEmbeddings with Roberta->ClapText, persistent=False->persistent=True
class ClapTextEmbeddings(nn.Module):
"""
Same as BertEmbeddings with a tiny tweak for positional embeddings indexing.
"""
# Copied from transformers.models.bert.modeling_bert.BertEmbeddings.__init__
def __init__(self, config):
super().__init__()
self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
# self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
# any TensorFlow checkpoint file
self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
# position_ids (1, len position emb) is contiguous in memory and exported when serialized
self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))
self.register_buffer(
"token_type_ids", torch.zeros(self.position_ids.size(), dtype=torch.long), persistent=True
)
# End copy
self.padding_idx = config.pad_token_id
self.position_embeddings = nn.Embedding(
config.max_position_embeddings, config.hidden_size, padding_idx=self.padding_idx
)
def forward(
self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None, past_key_values_length=0
):
if position_ids is None:
if input_ids is not None:
# Create the position ids from the input token ids. Any padded tokens remain padded.
position_ids = create_position_ids_from_input_ids(input_ids, self.padding_idx, past_key_values_length)
else:
position_ids = self.create_position_ids_from_inputs_embeds(inputs_embeds)
if input_ids is not None:
input_shape = input_ids.size()
else:
input_shape = inputs_embeds.size()[:-1]
seq_length = input_shape[1]
# Setting the token_type_ids to the registered buffer in constructor where it is all zeros, which usually occurs
# when its auto-generated, registered buffer helps users when tracing the model without passing token_type_ids, solves
# issue #5664
if token_type_ids is None:
if hasattr(self, "token_type_ids"):
buffered_token_type_ids = self.token_type_ids[:, :seq_length]
buffered_token_type_ids_expanded = buffered_token_type_ids.expand(input_shape[0], seq_length)
token_type_ids = buffered_token_type_ids_expanded
else:
token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device)
if inputs_embeds is None:
inputs_embeds = self.word_embeddings(input_ids)
token_type_embeddings = self.token_type_embeddings(token_type_ids)
embeddings = inputs_embeds + token_type_embeddings
if self.position_embedding_type == "absolute":
position_embeddings = self.position_embeddings(position_ids)
embeddings += position_embeddings
embeddings = self.LayerNorm(embeddings)
embeddings = self.dropout(embeddings)
return embeddings
def create_position_ids_from_inputs_embeds(self, inputs_embeds):
"""
We are provided embeddings directly. We cannot infer which are padded so just generate sequential position ids.
Args:
inputs_embeds: torch.Tensor
Returns: torch.Tensor
"""
input_shape = inputs_embeds.size()[:-1]
sequence_length = input_shape[1]
position_ids = torch.arange(
self.padding_idx + 1, sequence_length + self.padding_idx + 1, dtype=torch.long, device=inputs_embeds.device
)
return position_ids.unsqueeze(0).expand(input_shape)
# Copied from transformers.models.bert.modeling_bert.BertSelfAttention with Bert->ClapText
class ClapTextSelfAttention(nn.Module):
def __init__(self, config, position_embedding_type=None):
super().__init__()
if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
raise ValueError(
f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
f"heads ({config.num_attention_heads})"
)
self.num_attention_heads = config.num_attention_heads
self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
self.all_head_size = self.num_attention_heads * self.attention_head_size
self.query = nn.Linear(config.hidden_size, self.all_head_size)
self.key = nn.Linear(config.hidden_size, self.all_head_size)
self.value = nn.Linear(config.hidden_size, self.all_head_size)
self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
self.position_embedding_type = position_embedding_type or getattr(
config, "position_embedding_type", "absolute"
)
if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
self.max_position_embeddings = config.max_position_embeddings
self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)
self.is_decoder = config.is_decoder
def transpose_for_scores(self, x: torch.Tensor) -> torch.Tensor:
new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
x = x.view(new_x_shape)
return x.permute(0, 2, 1, 3)
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.FloatTensor] = None,
head_mask: Optional[torch.FloatTensor] = None,
encoder_hidden_states: Optional[torch.FloatTensor] = None,
encoder_attention_mask: Optional[torch.FloatTensor] = None,
past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
output_attentions: Optional[bool] = False,
) -> Tuple[torch.Tensor]:
mixed_query_layer = self.query(hidden_states)
# If this is instantiated as a cross-attention module, the keys
# and values come from an encoder; the attention mask needs to be
# such that the encoder's padding tokens are not attended to.
is_cross_attention = encoder_hidden_states is not None
if is_cross_attention and past_key_value is not None:
# reuse k,v, cross_attentions
key_layer = past_key_value[0]
value_layer = past_key_value[1]
attention_mask = encoder_attention_mask
elif is_cross_attention:
key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))
value_layer = self.transpose_for_scores(self.value(encoder_hidden_states))
attention_mask = encoder_attention_mask
elif past_key_value is not None:
key_layer = self.transpose_for_scores(self.key(hidden_states))
value_layer = self.transpose_for_scores(self.value(hidden_states))
key_layer = torch.cat([past_key_value[0], key_layer], dim=2)
value_layer = torch.cat([past_key_value[1], value_layer], dim=2)
else:
key_layer = self.transpose_for_scores(self.key(hidden_states))
value_layer = self.transpose_for_scores(self.value(hidden_states))
query_layer = self.transpose_for_scores(mixed_query_layer)
use_cache = past_key_value is not None
if self.is_decoder:
# if cross_attention save Tuple(torch.Tensor, torch.Tensor) of all cross attention key/value_states.
# Further calls to cross_attention layer can then reuse all cross-attention
# key/value_states (first "if" case)
# if uni-directional self-attention (decoder) save Tuple(torch.Tensor, torch.Tensor) of
# all previous decoder key/value_states. Further calls to uni-directional self-attention
# can concat previous decoder key/value_states to current projected key/value_states (third "elif" case)
# if encoder bi-directional self-attention `past_key_value` is always `None`
past_key_value = (key_layer, value_layer)
# Take the dot product between "query" and "key" to get the raw attention scores.
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
query_length, key_length = query_layer.shape[2], key_layer.shape[2]
if use_cache:
position_ids_l = torch.tensor(key_length - 1, dtype=torch.long, device=hidden_states.device).view(
-1, 1
)
else:
position_ids_l = torch.arange(query_length, dtype=torch.long, device=hidden_states.device).view(-1, 1)
position_ids_r = torch.arange(key_length, dtype=torch.long, device=hidden_states.device).view(1, -1)
distance = position_ids_l - position_ids_r
positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)
positional_embedding = positional_embedding.to(dtype=query_layer.dtype) # fp16 compatibility
if self.position_embedding_type == "relative_key":
relative_position_scores = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
attention_scores = attention_scores + relative_position_scores
elif self.position_embedding_type == "relative_key_query":
relative_position_scores_query = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
relative_position_scores_key = torch.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding)
attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
if attention_mask is not None:
# Apply the attention mask is (precomputed for all layers in ClapTextModel forward() function)
attention_scores = attention_scores + attention_mask
# Normalize the attention scores to probabilities.
attention_probs = nn.functional.softmax(attention_scores, dim=-1)
# This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper.
attention_probs = self.dropout(attention_probs)
# Mask heads if we want to
if head_mask is not None:
attention_probs = attention_probs * head_mask
context_layer = torch.matmul(attention_probs, value_layer)
context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
context_layer = context_layer.view(new_context_layer_shape)
outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
if self.is_decoder:
outputs = outputs + (past_key_value,)
return outputs
# Copied from transformers.models.bert.modeling_bert.BertSelfOutput
class ClapTextSelfOutput(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor:
hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states)
hidden_states = self.LayerNorm(hidden_states + input_tensor)
return hidden_states
# Copied from transformers.models.bert.modeling_bert.BertAttention with Bert->ClapText
class ClapTextAttention(nn.Module):
def __init__(self, config, position_embedding_type=None):
super().__init__()
self.self = ClapTextSelfAttention(config, position_embedding_type=position_embedding_type)
self.output = ClapTextSelfOutput(config)
self.pruned_heads = set()
def prune_heads(self, heads):
if len(heads) == 0:
return
heads, index = find_pruneable_heads_and_indices(
heads, self.self.num_attention_heads, self.self.attention_head_size, self.pruned_heads
)
# Prune linear layers
self.self.query = prune_linear_layer(self.self.query, index)
self.self.key = prune_linear_layer(self.self.key, index)
self.self.value = prune_linear_layer(self.self.value, index)
self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
# Update hyper params and store pruned heads
self.self.num_attention_heads = self.self.num_attention_heads - len(heads)
self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads
self.pruned_heads = self.pruned_heads.union(heads)
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.FloatTensor] = None,
head_mask: Optional[torch.FloatTensor] = None,
encoder_hidden_states: Optional[torch.FloatTensor] = None,
encoder_attention_mask: Optional[torch.FloatTensor] = None,
past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
output_attentions: Optional[bool] = False,
) -> Tuple[torch.Tensor]:
self_outputs = self.self(
hidden_states,
attention_mask,
head_mask,
encoder_hidden_states,
encoder_attention_mask,
past_key_value,
output_attentions,
)
attention_output = self.output(self_outputs[0], hidden_states)
outputs = (attention_output,) + self_outputs[1:] # add attentions if we output them
return outputs
# Copied from transformers.models.bert.modeling_bert.BertIntermediate
class ClapTextIntermediate(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
if isinstance(config.hidden_act, str):
self.intermediate_act_fn = ACT2FN[config.hidden_act]
else:
self.intermediate_act_fn = config.hidden_act
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
hidden_states = self.dense(hidden_states)
hidden_states = self.intermediate_act_fn(hidden_states)
return hidden_states
# Copied from transformers.models.bert.modeling_bert.BertOutput
class ClapTextOutput(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor:
hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states)
hidden_states = self.LayerNorm(hidden_states + input_tensor)
return hidden_states
# Copied from transformers.models.bert.modeling_bert.BertLayer with Bert->ClapText
class ClapTextLayer(nn.Module):
def __init__(self, config):
super().__init__()
self.chunk_size_feed_forward = config.chunk_size_feed_forward
self.seq_len_dim = 1
self.attention = ClapTextAttention(config)
self.is_decoder = config.is_decoder
self.add_cross_attention = config.add_cross_attention
if self.add_cross_attention:
if not self.is_decoder:
raise ValueError(f"{self} should be used as a decoder model if cross attention is added")
self.crossattention = ClapTextAttention(config, position_embedding_type="absolute")
self.intermediate = ClapTextIntermediate(config)
self.output = ClapTextOutput(config)
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.FloatTensor] = None,
head_mask: Optional[torch.FloatTensor] = None,
encoder_hidden_states: Optional[torch.FloatTensor] = None,
encoder_attention_mask: Optional[torch.FloatTensor] = None,
past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
output_attentions: Optional[bool] = False,
) -> Tuple[torch.Tensor]:
# decoder uni-directional self-attention cached key/values tuple is at positions 1,2
self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None
self_attention_outputs = self.attention(
hidden_states,
attention_mask,
head_mask,
output_attentions=output_attentions,
past_key_value=self_attn_past_key_value,
)
attention_output = self_attention_outputs[0]
# if decoder, the last output is tuple of self-attn cache
if self.is_decoder:
outputs = self_attention_outputs[1:-1]
present_key_value = self_attention_outputs[-1]
else:
outputs = self_attention_outputs[1:] # add self attentions if we output attention weights
cross_attn_present_key_value = None
if self.is_decoder and encoder_hidden_states is not None:
if not hasattr(self, "crossattention"):
raise ValueError(
f"If `encoder_hidden_states` are passed, {self} has to be instantiated with cross-attention layers"
" by setting `config.add_cross_attention=True`"
)
# cross_attn cached key/values tuple is at positions 3,4 of past_key_value tuple
cross_attn_past_key_value = past_key_value[-2:] if past_key_value is not None else None
cross_attention_outputs = self.crossattention(
attention_output,
attention_mask,
head_mask,
encoder_hidden_states,
encoder_attention_mask,
cross_attn_past_key_value,
output_attentions,
)
attention_output = cross_attention_outputs[0]
outputs = outputs + cross_attention_outputs[1:-1] # add cross attentions if we output attention weights
# add cross-attn cache to positions 3,4 of present_key_value tuple
cross_attn_present_key_value = cross_attention_outputs[-1]
present_key_value = present_key_value + cross_attn_present_key_value
layer_output = apply_chunking_to_forward(
self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
)
outputs = (layer_output,) + outputs
# if decoder, return the attn key/values as the last output
if self.is_decoder:
outputs = outputs + (present_key_value,)
return outputs
def feed_forward_chunk(self, attention_output):
intermediate_output = self.intermediate(attention_output)
layer_output = self.output(intermediate_output, attention_output)
return layer_output
# Copied from transformers.models.bert.modeling_bert.BertEncoder with Bert->ClapText
class ClapTextEncoder(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.layer = nn.ModuleList([ClapTextLayer(config) for _ in range(config.num_hidden_layers)])
self.gradient_checkpointing = False
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.FloatTensor] = None,
head_mask: Optional[torch.FloatTensor] = None,
encoder_hidden_states: Optional[torch.FloatTensor] = None,
encoder_attention_mask: Optional[torch.FloatTensor] = None,
past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = False,
output_hidden_states: Optional[bool] = False,
return_dict: Optional[bool] = True,
) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPastAndCrossAttentions]:
all_hidden_states = () if output_hidden_states else None
all_self_attentions = () if output_attentions else None
all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None
next_decoder_cache = () if use_cache else None
for i, layer_module in enumerate(self.layer):
if output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states,)
layer_head_mask = head_mask[i] if head_mask is not None else None
past_key_value = past_key_values[i] if past_key_values is not None else None
if self.gradient_checkpointing and self.training:
if use_cache:
logger.warning(
"`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
)
use_cache = False
def create_custom_forward(module):
def custom_forward(*inputs):
return module(*inputs, past_key_value, output_attentions)
return custom_forward
layer_outputs = torch.utils.checkpoint.checkpoint(
create_custom_forward(layer_module),
hidden_states,
attention_mask,
layer_head_mask,
encoder_hidden_states,
encoder_attention_mask,
)
else:
layer_outputs = layer_module(
hidden_states,
attention_mask,
layer_head_mask,
encoder_hidden_states,
encoder_attention_mask,
past_key_value,
output_attentions,
)
hidden_states = layer_outputs[0]
if use_cache:
next_decoder_cache += (layer_outputs[-1],)
if output_attentions:
all_self_attentions = all_self_attentions + (layer_outputs[1],)
if self.config.add_cross_attention:
all_cross_attentions = all_cross_attentions + (layer_outputs[2],)
if output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states,)
if not return_dict:
return tuple(
v
for v in [
hidden_states,
next_decoder_cache,
all_hidden_states,
all_self_attentions,
all_cross_attentions,
]
if v is not None
)
return BaseModelOutputWithPastAndCrossAttentions(
last_hidden_state=hidden_states,
past_key_values=next_decoder_cache,
hidden_states=all_hidden_states,
attentions=all_self_attentions,
cross_attentions=all_cross_attentions,
)
# Copied from transformers.models.bert.modeling_bert.BertPooler
class ClapTextPooler(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.activation = nn.Tanh()
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
# We "pool" the model by simply taking the hidden state corresponding
# to the first token.
first_token_tensor = hidden_states[:, 0]
pooled_output = self.dense(first_token_tensor)
pooled_output = self.activation(pooled_output)
return pooled_output
class ClapPreTrainedModel(PreTrainedModel):
"""
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
models.
"""
config_class = ClapTextConfig
base_model_prefix = "clap"
supports_gradient_checkpointing = False
_keys_to_ignore_on_load_missing = [r"position_ids", r"logit_scale_a", r"logit_scale_t"]
def _init_weights(self, module):
"""Initialize the weights"""
factor = self.config.initializer_factor
if isinstance(module, ClapTextEmbeddings):
module.position_embeddings.weight.data.normal_(mean=0.0, std=factor * 0.02)
module.token_type_embeddings.weight.data.normal_(mean=0.0, std=factor * 0.02)
elif isinstance(module, ClapModel):
nn.init.normal_(module.logit_scale_a, std=factor * 0.02)
nn.init.normal_(module.logit_scale_t, std=factor * 0.02)
elif isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=factor * 0.02)
elif isinstance(module, nn.LayerNorm):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
elif isinstance(module, (nn.Conv2d, nn.Linear)):
in_proj_std = (self.config.hidden_size**-0.5) * ((2 * self.config.num_hidden_layers) ** -0.5) * factor
nn.init.normal_(module.weight, std=in_proj_std)
if module.bias is not None:
module.bias.data.zero_()
def _set_gradient_checkpointing(self, module, value=False):
if isinstance(module, ClapTextEncoder):
module.gradient_checkpointing = value
class ClapAudioModel(ClapPreTrainedModel):
config_class = ClapAudioConfig
main_input_name = "input_features"
def __init__(self, config: ClapAudioConfig):
super().__init__(config)
self.audio_encoder = ClapAudioEncoder(config)
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self) -> nn.Module:
return self.audio_encoder.patch_embed.proj
@add_start_docstrings_to_model_forward(CLAP_AUDIO_INPUTS_DOCSTRING)
@replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=ClapAudioConfig)
def forward(
self,
input_features: Optional[torch.FloatTensor] = None,
is_longer: Optional[torch.BoolTensor] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, BaseModelOutputWithPooling]:
r"""
Returns:
Examples:
```python
>>> from datasets import load_dataset
>>> from transformers import AutoProcessor, ClapAudioModel
>>> dataset = load_dataset("ashraq/esc50")
>>> audio_sample = dataset["train"]["audio"][0]["array"]
>>> model = ClapAudioModel.from_pretrained("laion/clap-htsat-fused")
>>> processor = AutoProcessor.from_pretrained("laion/clap-htsat-fused")
>>> inputs = processor(audios=audio_sample, return_tensors="pt")
>>> outputs = model(**inputs)
>>> last_hidden_state = outputs.audio_emmbeds
```"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
return self.audio_encoder(
input_features=input_features,
is_longer=is_longer,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
class ClapTextModel(ClapPreTrainedModel):
"""
The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of
cross-attention is added between the self-attention layers, following the architecture described in *Attention is
all you need*_ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser and Illia Polosukhin.
To behave as an decoder the model needs to be initialized with the `is_decoder` argument of the configuration set
to `True`. To be used in a Seq2Seq model, the model needs to initialized with both `is_decoder` argument and
`add_cross_attention` set to `True`; an `encoder_hidden_states` is then expected as an input to the forward pass.
.. _*Attention is all you need*: https://arxiv.org/abs/1706.03762
"""
config_class = ClapTextConfig
_keys_to_ignore_on_load_missing = [r"position_ids"]
# Copied from transformers.models.bert.modeling_bert.BertModel.__init__ with Bert->ClapText
def __init__(self, config, add_pooling_layer=True):
super().__init__(config)
self.config = config
self.embeddings = ClapTextEmbeddings(config)
self.encoder = ClapTextEncoder(config)
self.pooler = ClapTextPooler(config) if add_pooling_layer else None
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.embeddings.word_embeddings
def set_input_embeddings(self, value):
self.embeddings.word_embeddings = value
# Copied from transformers.models.bert.modeling_bert.BertModel.forward
def forward(
self,
input_ids: Optional[torch.Tensor] = None,
attention_mask: Optional[torch.Tensor] = None,
token_type_ids: Optional[torch.Tensor] = None,
position_ids: Optional[torch.Tensor] = None,
head_mask: Optional[torch.Tensor] = None,
inputs_embeds: Optional[torch.Tensor] = None,
encoder_hidden_states: Optional[torch.Tensor] = None,
encoder_attention_mask: Optional[torch.Tensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPoolingAndCrossAttentions]:
r"""
encoder_hidden_states (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
the model is configured as a decoder.
encoder_attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*):
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
past_key_values (`tuple(tuple(torch.FloatTensor))` of length `config.n_layers` with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
`decoder_input_ids` of shape `(batch_size, sequence_length)`.
use_cache (`bool`, *optional*):
If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
`past_key_values`).
"""
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
if self.config.is_decoder:
use_cache = use_cache if use_cache is not None else self.config.use_cache
else:
use_cache = False
if input_ids is not None and inputs_embeds is not None:
raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
elif input_ids is not None:
input_shape = input_ids.size()
elif inputs_embeds is not None:
input_shape = inputs_embeds.size()[:-1]
else:
raise ValueError("You have to specify either input_ids or inputs_embeds")
batch_size, seq_length = input_shape
device = input_ids.device if input_ids is not None else inputs_embeds.device
# past_key_values_length
past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0
if attention_mask is None:
attention_mask = torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)
if token_type_ids is None:
if hasattr(self.embeddings, "token_type_ids"):
buffered_token_type_ids = self.embeddings.token_type_ids[:, :seq_length]
buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
token_type_ids = buffered_token_type_ids_expanded
else:
token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)
# We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
# ourselves in which case we just need to make it broadcastable to all heads.
extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape)
# If a 2D or 3D attention mask is provided for the cross-attention
# we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
if self.config.is_decoder and encoder_hidden_states is not None:
encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()
encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
if encoder_attention_mask is None:
encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)
encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
else:
encoder_extended_attention_mask = None
# Prepare head mask if needed
# 1.0 in head_mask indicate we keep the head
# attention_probs has shape bsz x n_heads x N x N
# input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
# and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
embedding_output = self.embeddings(
input_ids=input_ids,
position_ids=position_ids,
token_type_ids=token_type_ids,
inputs_embeds=inputs_embeds,
past_key_values_length=past_key_values_length,
)
encoder_outputs = self.encoder(
embedding_output,
attention_mask=extended_attention_mask,
head_mask=head_mask,
encoder_hidden_states=encoder_hidden_states,
encoder_attention_mask=encoder_extended_attention_mask,
past_key_values=past_key_values,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
sequence_output = encoder_outputs[0]
pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
if not return_dict:
return (sequence_output, pooled_output) + encoder_outputs[1:]
return BaseModelOutputWithPoolingAndCrossAttentions(
last_hidden_state=sequence_output,
pooler_output=pooled_output,
past_key_values=encoder_outputs.past_key_values,
hidden_states=encoder_outputs.hidden_states,
attentions=encoder_outputs.attentions,
cross_attentions=encoder_outputs.cross_attentions,
)
@add_start_docstrings(CLAP_START_DOCSTRING)
class ClapModel(ClapPreTrainedModel):
config_class = ClapConfig
_keys_to_ignore_on_load_missing = [r"position_ids"]
def __init__(self, config: ClapConfig):
super().__init__(config)
if not isinstance(config.text_config, ClapTextConfig):
raise ValueError(
"config.text_config is expected to be of type ClapTextConfig but is of type"
f" {type(config.text_config)}."
)
if not isinstance(config.audio_config, ClapAudioConfig):
raise ValueError(
"config.audio_config is expected to be of type ClapAudioConfig but is of type"
f" {type(config.audio_config)}."
)
text_config = config.text_config
audio_config = config.audio_config
self.logit_scale_a = nn.Parameter(torch.ones([]) * np.log(config.logit_scale_init_value))
self.logit_scale_t = nn.Parameter(torch.ones([]) * np.log(config.logit_scale_init_value))
self.projection_dim = config.projection_dim
self.text_model = ClapTextModel(text_config)
self.text_projection = ClapProjectionLayer(text_config)
self.audio_model = ClapAudioModel(audio_config)
self.audio_projection = ClapProjectionLayer(audio_config)
# Initialize weights and apply final processing
self.post_init()
@add_start_docstrings_to_model_forward(CLAP_TEXT_INPUTS_DOCSTRING)
def get_text_features(
self,
input_ids: Optional[torch.Tensor] = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.Tensor] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> torch.FloatTensor:
r"""
Returns:
text_features (`torch.FloatTensor` of shape `(batch_size, output_dim`): The text embeddings obtained by
applying the projection layer to the pooled output of [`ClapTextModel`].
Examples:
```python
>>> from transformers import AutoTokenizer, ClapModel
>>> model = ClapModel.from_pretrained("laion/clap-htsat-unfused")
>>> tokenizer = AutoTokenizer.from_pretrained("laion/clap-htsat-unfused")
>>> inputs = tokenizer(["the sound of a cat", "the sound of a dog"], padding=True, return_tensors="pt")
>>> text_features = model.get_text_features(**inputs)
```"""
# Use CLAP model's config for some fields (if specified) instead of those of audio & text components.
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
text_outputs = self.text_model(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
pooled_output = text_outputs[1] if return_dict is not None else text_outputs.pooler_output
text_features = self.text_projection(pooled_output)
text_features = F.normalize(text_features, dim=-1)
return text_features
@add_start_docstrings_to_model_forward(CLAP_AUDIO_INPUTS_DOCSTRING)
def get_audio_features(
self,
input_features: Optional[torch.Tensor] = None,
is_longer: Optional[torch.Tensor] = None,
attention_mask: Optional[torch.Tensor] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> torch.FloatTensor:
r"""
Returns:
audio_features (`torch.FloatTensor` of shape `(batch_size, output_dim`): The audio embeddings obtained by
applying the projection layer to the pooled output of [`ClapAudioModel`].
Examples:
```python
>>> from transformers import AutoFeatureExtractor, ClapModel
>>> import torch
>>> model = ClapModel.from_pretrained("laion/clap-htsat-unfused")
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("laion/clap-htsat-unfused")
>>> random_audio = torch.rand((16_000))
>>> inputs = feature_extractor(random_audio, return_tensors="pt")
>>> audio_features = model.get_audio_features(**inputs)
```"""
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
audio_outputs = self.audio_model(
input_features=input_features,
is_longer=is_longer,
return_dict=return_dict,
)
pooled_output = audio_outputs[1] if not return_dict else audio_outputs.pooler_output
audio_features = self.audio_projection(pooled_output)
audio_features = F.normalize(audio_features, dim=-1)
return audio_features
@add_start_docstrings_to_model_forward(CLAP_INPUTS_DOCSTRING)
@replace_return_docstrings(output_type=ClapOutput, config_class=ClapConfig)
def forward(
self,
input_ids: Optional[torch.LongTensor] = None,
input_features: Optional[torch.FloatTensor] = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
return_loss: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, ClapOutput]:
r"""
Returns:
Examples:
```python
>>> from datasets import load_dataset
>>> from transformers import AutoProcessor, ClapModel
>>> dataset = load_dataset("ashraq/esc50")
>>> audio_sample = dataset["train"]["audio"][0]["array"]
>>> model = ClapModel.from_pretrained("laion/clap-htsat-unfused")
>>> processor = AutoProcessor.from_pretrained("laion/clap-htsat-unfused")
>>> input_text = ["Sound of a dog", "Sound of vaccum cleaner"]
>>> inputs = processor(text=input_text, audios=audio_sample, return_tensors="pt", padding=True)
>>> outputs = model(**inputs)
>>> logits_per_audio = outputs.logits_per_audio # this is the audio-text similarity score
>>> probs = logits_per_audio.softmax(dim=-1) # we can take the softmax to get the label probabilities
```"""
# Use CLAP model's config for some fields (if specified) instead of those of audio & text components.
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
audio_outputs = self.audio_model(
input_features=input_features,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
text_outputs = self.text_model(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
audio_embeds = audio_outputs[1] if not return_dict else audio_outputs.pooler_output
audio_embeds = self.audio_projection(audio_embeds)
text_embeds = text_outputs[1] if not return_dict else text_outputs.pooler_output
text_embeds = self.text_projection(text_embeds)
# normalized features
audio_embeds = audio_embeds / audio_embeds.norm(p=2, dim=-1, keepdim=True)
text_embeds = text_embeds / text_embeds.norm(p=2, dim=-1, keepdim=True)
# cosine similarity as logits
logit_scale_text = self.logit_scale_t.exp()
logit_scale_audio = self.logit_scale_a.exp()
logits_per_text = torch.matmul(text_embeds, audio_embeds.t()) * logit_scale_text
logits_per_audio = torch.matmul(audio_embeds, text_embeds.t()) * logit_scale_audio
loss = None
if return_loss:
caption_loss = contrastive_loss(logits_per_text)
audio_loss = contrastive_loss(logits_per_text.t())
loss = (caption_loss + audio_loss) / 2.0
if not return_dict:
output = (logits_per_audio, logits_per_text, text_embeds, audio_embeds, text_outputs, audio_outputs)
return ((loss,) + output) if loss is not None else output
return ClapOutput(
loss=loss,
logits_per_audio=logits_per_audio,
logits_per_text=logits_per_text,
text_embeds=text_embeds,
audio_embeds=audio_embeds,
text_model_output=text_outputs,
audio_model_output=audio_outputs,
)
@add_start_docstrings(
"""
CLAP Text Model with a projection layer on top (a linear layer on top of the pooled output).
""",
CLAP_START_DOCSTRING,
)
class ClapTextModelWithProjection(ClapPreTrainedModel):
config_class = ClapTextConfig
def __init__(self, config: ClapTextConfig):
super().__init__(config)
self.text_model = ClapTextModel(config)
self.text_projection = ClapProjectionLayer(config)
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self) -> nn.Module:
return self.text_model.embeddings.word_embeddings
def set_input_embeddings(self, value):
self.text_model.embeddings.word_embeddings = value
@add_start_docstrings_to_model_forward(CLAP_TEXT_INPUTS_DOCSTRING)
@replace_return_docstrings(output_type=ClapTextModelOutput, config_class=ClapTextConfig)
def forward(
self,
input_ids: Optional[torch.Tensor] = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.Tensor] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, ClapTextModelOutput]:
r"""
Returns:
Examples:
```python
>>> from transformers import AutoTokenizer, ClapTextModelWithProjection
>>> model = ClapTextModelWithProjection.from_pretrained("laion/clap-htsat-unfused")
>>> tokenizer = AutoTokenizer.from_pretrained("laion/clap-htsat-unfused")
>>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
>>> outputs = model(**inputs)
>>> text_embeds = outputs.text_embeds
```"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
text_outputs = self.text_model(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
pooled_output = text_outputs[1] if not return_dict else text_outputs.pooler_output
text_embeds = self.text_projection(pooled_output)
if not return_dict:
outputs = (text_embeds, text_outputs[0]) + text_outputs[2:]
return tuple(output for output in outputs if output is not None)
return ClapTextModelOutput(
text_embeds=text_embeds,
last_hidden_state=text_outputs.last_hidden_state,
hidden_states=text_outputs.hidden_states,
attentions=text_outputs.attentions,
)
@add_start_docstrings(
"""
CLAP Audio Model with a projection layer on top (a linear layer on top of the pooled output).
""",
CLAP_START_DOCSTRING,
)
class ClapAudioModelWithProjection(ClapPreTrainedModel):
config_class = ClapAudioConfig
main_input_name = "input_features"
def __init__(self, config: ClapAudioConfig):
super().__init__(config)
self.audio_model = ClapAudioModel(config)
self.audio_projection = ClapProjectionLayer(config)
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self) -> nn.Module:
return self.audio_model.audio_encoder.patch_embed.proj
@add_start_docstrings_to_model_forward(CLAP_AUDIO_INPUTS_DOCSTRING)
@replace_return_docstrings(output_type=ClapAudioModelOutput, config_class=ClapAudioConfig)
def forward(
self,
input_features: Optional[torch.FloatTensor] = None,
is_longer: Optional[torch.BoolTensor] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, ClapAudioModelOutput]:
r"""
Returns:
Examples:
```python
>>> from datasets import load_dataset
>>> from transformers import ClapAudioModelWithProjection, ClapProcessor
>>> model = ClapAudioModelWithProjection.from_pretrained("laion/clap-htsat-fused")
>>> processor = ClapProcessor.from_pretrained("laion/clap-htsat-fused")
>>> dataset = load_dataset("ashraq/esc50")
>>> audio_sample = dataset["train"]["audio"][0]["array"]
>>> inputs = processor(audios=audio_sample, return_tensors="pt")
>>> outputs = model(**inputs)
>>> audio_embeds = outputs.audio_embeds
```"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
audio_outputs = self.audio_model(
input_features=input_features,
is_longer=is_longer,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
pooled_output = audio_outputs[1] if not return_dict else audio_outputs.pooler_output
audio_embeds = self.audio_projection(pooled_output)
if not return_dict:
outputs = (audio_embeds, audio_outputs[0]) + audio_outputs[2:]
return tuple(output for output in outputs if output is not None)
return ClapAudioModelOutput(
audio_embeds=audio_embeds,
last_hidden_state=audio_outputs.last_hidden_state,
attentions=audio_outputs.attentions,
hidden_states=audio_outputs.hidden_states,
)
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Audio/Text processor class for CLAP
"""
from ...processing_utils import ProcessorMixin
from ...tokenization_utils_base import BatchEncoding
class ClapProcessor(ProcessorMixin):
r"""
Constructs a CLAP processor which wraps a CLAP feature extractor and a RoBerta tokenizer into a single processor.
[`ClapProcessor`] offers all the functionalities of [`ClapFeatureExtractor`] and [`RobertaTokenizerFast`]. See the
[`~ClapProcessor.__call__`] and [`~ClapProcessor.decode`] for more information.
Args:
feature_extractor ([`ClapFeatureExtractor`]):
The audio processor is a required input.
tokenizer ([`RobertaTokenizerFast`]):
The tokenizer is a required input.
"""
feature_extractor_class = "ClapFeatureExtractor"
tokenizer_class = ("RobertaTokenizer", "RobertaTokenizerFast")
def __init__(self, feature_extractor, tokenizer):
super().__init__(feature_extractor, tokenizer)
def __call__(self, text=None, audios=None, return_tensors=None, **kwargs):
"""
Main method to prepare for the model one or several sequences(s) and audio(s). This method forwards the `text`
and `kwargs` arguments to RobertaTokenizerFast's [`~RobertaTokenizerFast.__call__`] if `text` is not `None` to
encode the text. To prepare the audio(s), this method forwards the `audios` and `kwrags` arguments to
ClapFeatureExtractor's [`~ClapFeatureExtractor.__call__`] if `audios` is not `None`. Please refer to the
doctsring of the above two methods for more information.
Args:
text (`str`, `List[str]`, `List[List[str]]`):
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
audios (`np.ndarray`, `torch.Tensor`, `List[np.ndarray]`, `List[torch.Tensor]`):
The audio or batch of audios to be prepared. Each audio can be NumPy array or PyTorch tensor. In case
of a NumPy array/PyTorch tensor, each audio should be of shape (C, T), where C is a number of channels,
and T the sample length of the audio.
return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors of a particular framework. Acceptable values are:
- `'tf'`: Return TensorFlow `tf.constant` objects.
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return NumPy `np.ndarray` objects.
- `'jax'`: Return JAX `jnp.ndarray` objects.
Returns:
[`BatchEncoding`]: A [`BatchEncoding`] with the following fields:
- **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
`return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
`None`).
- **audio_features** -- Audio features to be fed to a model. Returned when `audios` is not `None`.
"""
sampling_rate = kwargs.pop("sampling_rate", None)
if text is None and audios is None:
raise ValueError("You have to specify either text or audios. Both cannot be none.")
if text is not None:
encoding = self.tokenizer(text, return_tensors=return_tensors, **kwargs)
if audios is not None:
audio_features = self.feature_extractor(
audios, sampling_rate=sampling_rate, return_tensors=return_tensors, **kwargs
)
if text is not None and audios is not None:
encoding["input_features"] = audio_features.input_features
return encoding
elif text is not None:
return encoding
else:
return BatchEncoding(data=dict(**audio_features), tensor_type=return_tensors)
def batch_decode(self, *args, **kwargs):
"""
This method forwards all its arguments to RobertaTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
refer to the docstring of this method for more information.
"""
return self.tokenizer.batch_decode(*args, **kwargs)
def decode(self, *args, **kwargs):
"""
This method forwards all its arguments to RobertaTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer
to the docstring of this method for more information.
"""
return self.tokenizer.decode(*args, **kwargs)
@property
def model_input_names(self):
tokenizer_input_names = self.tokenizer.model_input_names
feature_extractor_input_names = self.feature_extractor.model_input_names
return list(dict.fromkeys(tokenizer_input_names + feature_extractor_input_names))
...@@ -1464,6 +1464,58 @@ class ChineseCLIPVisionModel(metaclass=DummyObject): ...@@ -1464,6 +1464,58 @@ class ChineseCLIPVisionModel(metaclass=DummyObject):
requires_backends(self, ["torch"]) requires_backends(self, ["torch"])
CLAP_PRETRAINED_MODEL_ARCHIVE_LIST = None
class ClapAudioModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class ClapAudioModelWithProjection(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class ClapFeatureExtractor(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class ClapModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class ClapPreTrainedModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class ClapTextModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class ClapTextModelWithProjection(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
CLIP_PRETRAINED_MODEL_ARCHIVE_LIST = None CLIP_PRETRAINED_MODEL_ARCHIVE_LIST = None
......
# coding=utf-8
# Copyright 2023 HuggingFace Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import itertools
import random
import unittest
import numpy as np
from transformers import ClapFeatureExtractor
from transformers.testing_utils import require_torch, require_torchaudio
from transformers.utils.import_utils import is_torch_available
from ...test_sequence_feature_extraction_common import SequenceFeatureExtractionTestMixin
if is_torch_available():
import torch
global_rng = random.Random()
# Copied from tests.models.whisper.test_feature_extraction_whisper.floats_list
def floats_list(shape, scale=1.0, rng=None, name=None):
"""Creates a random float32 tensor"""
if rng is None:
rng = global_rng
values = []
for batch_idx in range(shape[0]):
values.append([])
for _ in range(shape[1]):
values[-1].append(rng.random() * scale)
return values
@require_torch
@require_torchaudio
# Copied from tests.models.whisper.test_feature_extraction_whisper.WhisperFeatureExtractionTester with Whisper->Clap
class ClapFeatureExtractionTester(unittest.TestCase):
def __init__(
self,
parent,
batch_size=7,
min_seq_length=400,
max_seq_length=2000,
feature_size=10,
hop_length=160,
chunk_length=8,
padding_value=0.0,
sampling_rate=4_000,
return_attention_mask=False,
do_normalize=True,
):
self.parent = parent
self.batch_size = batch_size
self.min_seq_length = min_seq_length
self.max_seq_length = max_seq_length
self.seq_length_diff = (self.max_seq_length - self.min_seq_length) // (self.batch_size - 1)
self.padding_value = padding_value
self.sampling_rate = sampling_rate
self.return_attention_mask = return_attention_mask
self.do_normalize = do_normalize
self.feature_size = feature_size
self.chunk_length = chunk_length
self.hop_length = hop_length
def prepare_feat_extract_dict(self):
return {
"feature_size": self.feature_size,
"hop_length": self.hop_length,
"chunk_length": self.chunk_length,
"padding_value": self.padding_value,
"sampling_rate": self.sampling_rate,
"return_attention_mask": self.return_attention_mask,
"do_normalize": self.do_normalize,
}
def prepare_inputs_for_common(self, equal_length=False, numpify=False):
def _flatten(list_of_lists):
return list(itertools.chain(*list_of_lists))
if equal_length:
speech_inputs = [floats_list((self.max_seq_length, self.feature_size)) for _ in range(self.batch_size)]
else:
# make sure that inputs increase in size
speech_inputs = [
floats_list((x, self.feature_size))
for x in range(self.min_seq_length, self.max_seq_length, self.seq_length_diff)
]
if numpify:
speech_inputs = [np.asarray(x) for x in speech_inputs]
return speech_inputs
@require_torch
@require_torchaudio
# Copied from tests.models.whisper.test_feature_extraction_whisper.WhisperFeatureExtractionTest with Whisper->Clap
class ClapFeatureExtractionTest(SequenceFeatureExtractionTestMixin, unittest.TestCase):
feature_extraction_class = ClapFeatureExtractor
def setUp(self):
self.feat_extract_tester = ClapFeatureExtractionTester(self)
def test_call(self):
# Tests that all call wrap to encode_plus and batch_encode_plus
feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
# create three inputs of length 800, 1000, and 1200
speech_inputs = [floats_list((1, x))[0] for x in range(800, 1400, 200)]
np_speech_inputs = [np.asarray(speech_input) for speech_input in speech_inputs]
# Test feature size
input_features = feature_extractor(np_speech_inputs, padding="max_length", return_tensors="np").input_features
self.assertTrue(input_features.ndim == 4)
# Test not batched input
encoded_sequences_1 = feature_extractor(speech_inputs[0], return_tensors="np").input_features
encoded_sequences_2 = feature_extractor(np_speech_inputs[0], return_tensors="np").input_features
self.assertTrue(np.allclose(encoded_sequences_1, encoded_sequences_2, atol=1e-3))
# Test batched
encoded_sequences_1 = feature_extractor(speech_inputs, return_tensors="np").input_features
encoded_sequences_2 = feature_extractor(np_speech_inputs, return_tensors="np").input_features
for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
def test_double_precision_pad(self):
import torch
feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
np_speech_inputs = np.random.rand(100, 32).astype(np.float64)
py_speech_inputs = np_speech_inputs.tolist()
for inputs in [py_speech_inputs, np_speech_inputs]:
np_processed = feature_extractor.pad([{"input_features": inputs}], return_tensors="np")
self.assertTrue(np_processed.input_features.dtype == np.float32)
pt_processed = feature_extractor.pad([{"input_features": inputs}], return_tensors="pt")
self.assertTrue(pt_processed.input_features.dtype == torch.float32)
def _load_datasamples(self, num_samples):
from datasets import load_dataset
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
# automatic decoding with librispeech
speech_samples = ds.sort("id").select(range(num_samples))[:num_samples]["audio"]
return [x["array"] for x in speech_samples]
def integration_test_fusion(self):
# fmt: off
EXPECTED_INPUT_FEATURES = torch.tensor(
[
[
-30.2194, -22.4424, -18.6442, -17.2452, -22.7392, -32.2576, -36.1404,
-35.6120, -29.6229, -29.0454, -32.2157, -36.7664, -29.4436, -26.7825,
-31.1811, -38.3918, -38.8749, -43.4485, -47.6236, -38.7528, -31.8574,
-39.0591, -41.3190, -32.3319, -31.4699, -33.4502, -36.7412, -34.5265,
-35.1091, -40.4518, -42.7346, -44.5909, -44.9747, -45.8328, -47.0772,
-46.2723, -44.3613, -48.6253, -44.9551, -43.8700, -44.6104, -48.0146,
-42.7614, -47.3587, -47.4369, -45.5018, -47.0198, -42.8759, -47.5056,
-47.1567, -49.2621, -49.5643, -48.4330, -48.8495, -47.2512, -40.8439,
-48.1234, -49.1218, -48.7222, -50.2399, -46.8487, -41.9921, -50.4015,
-50.7827
],
[
-89.0141, -89.1411, -88.8096, -88.5480, -88.3481, -88.2038,
-88.1105, -88.0647, -88.0636, -88.1051, -88.1877, -88.1110,
-87.8613, -88.6679, -88.2685, -88.9684, -88.7977, -89.6264,
-89.9299, -90.3184, -91.1446, -91.9265, -92.7267, -93.6099,
-94.6395, -95.3243, -95.5923, -95.5773, -95.0889, -94.3354,
-93.5746, -92.9287, -92.4525, -91.9798, -91.8852, -91.7500,
-91.7259, -91.7561, -91.7959, -91.7070, -91.6914, -91.5019,
-91.0640, -90.0807, -88.7102, -87.0826, -85.5956, -84.4441,
-83.8461, -83.8605, -84.6702, -86.3900, -89.3073, -93.2926,
-96.3813, -97.3529, -100.0000, -99.6942, -92.2851, -87.9588,
-85.7214, -84.6807, -84.1940, -84.2021
],
[
-51.6882, -50.6852, -50.8198, -51.7428, -53.0325, -54.1619, -56.4903,
-59.0314, -60.7996, -60.5164, -59.9680, -60.5393, -62.5796, -65.4166,
-65.6149, -65.1409, -65.7226, -67.9057, -72.5089, -82.3530, -86.3189,
-83.4241, -79.1279, -79.3384, -82.7335, -79.8316, -80.2167, -74.3638,
-71.3930, -75.3849, -74.5381, -71.4504, -70.3791, -71.4547, -71.8820,
-67.3885, -69.5686, -71.9852, -71.0307, -73.0053, -80.8802, -72.9227,
-63.8526, -60.3260, -59.6012, -57.8316, -61.0603, -67.3403, -67.1709,
-60.4967, -60.5079, -68.3345, -67.5213, -70.6416, -79.6219, -78.2198,
-74.6851, -69.5718, -69.4968, -70.6882, -66.8175, -73.8558, -74.3855,
-72.9405
]
]
)
# fmt: on
MEL_BIN = [963, 963, 161]
input_speech = self._load_datasamples(1)
feaure_extractor = ClapFeatureExtractor()
for padding, EXPECTED_VALUES, idx_in_mel in zip(
["repeat", "repeatpad", None], EXPECTED_INPUT_FEATURES, MEL_BIN
):
input_features = feaure_extractor(input_speech, return_tensors="pt", padding=padding).input_features
self.assertTrue(torch.allclose(input_features[0, idx_in_mel], EXPECTED_VALUES, atol=1e-4))
def integration_test_rand_trunc(self):
# TODO in this case we should set the seed and use a longer audio to properly see the random truncation
# fmt: off
EXPECTED_INPUT_FEATURES = torch.tensor(
[
[
-42.3330, -36.2735, -35.9231, -43.5947, -48.4525, -46.5227, -42.6477,
-47.2740, -51.4336, -50.0846, -51.8711, -50.4232, -47.4736, -54.2275,
-53.3947, -55.4904, -54.8750, -54.5510, -55.4156, -57.4395, -51.7385,
-55.9118, -57.7800, -63.2064, -67.0651, -61.4379, -56.4268, -54.8667,
-52.3487, -56.4418, -57.1842, -55.1005, -55.6366, -59.4395, -56.8604,
-56.4949, -61.6573, -61.0826, -60.3250, -63.7876, -67.4882, -60.2323,
-54.6886, -50.5369, -47.7656, -45.8909, -49.1273, -57.4141, -58.3201,
-51.9862, -51.4897, -59.2561, -60.4730, -61.2203, -69.3174, -69.7464,
-65.5861, -58.9921, -59.5610, -61.0584, -58.1149, -64.4045, -66.2622,
-64.4610
],
[
-41.2298, -38.4211, -39.8834, -45.9950, -47.3839, -43.9849, -46.0371,
-52.5490, -56.6912, -51.8794, -50.1284, -49.7506, -53.9422, -63.2854,
-56.5754, -55.0469, -55.3181, -55.8115, -56.0058, -57.9215, -58.7597,
-59.1994, -59.2141, -64.4198, -73.5138, -64.4647, -59.3351, -54.5626,
-54.7508, -65.0230, -60.0270, -54.7644, -56.0108, -60.1531, -57.6879,
-56.3766, -63.3395, -65.3032, -61.5202, -63.0677, -68.4217, -60.6868,
-54.4619, -50.8533, -47.7200, -45.9197, -49.0961, -57.7621, -59.0750,
-51.9122, -51.4332, -59.4132, -60.3415, -61.6558, -70.7049, -69.7905,
-66.9104, -59.0324, -59.6138, -61.2023, -58.2169, -65.3837, -66.4425,
-64.4142
],
[
-51.6882, -50.6852, -50.8198, -51.7428, -53.0325, -54.1619, -56.4903,
-59.0314, -60.7996, -60.5164, -59.9680, -60.5393, -62.5796, -65.4166,
-65.6149, -65.1409, -65.7226, -67.9057, -72.5089, -82.3530, -86.3189,
-83.4241, -79.1279, -79.3384, -82.7335, -79.8316, -80.2167, -74.3638,
-71.3930, -75.3849, -74.5381, -71.4504, -70.3791, -71.4547, -71.8820,
-67.3885, -69.5686, -71.9852, -71.0307, -73.0053, -80.8802, -72.9227,
-63.8526, -60.3260, -59.6012, -57.8316, -61.0603, -67.3403, -67.1709,
-60.4967, -60.5079, -68.3345, -67.5213, -70.6416, -79.6219, -78.2198,
-74.6851, -69.5718, -69.4968, -70.6882, -66.8175, -73.8558, -74.3855,
-72.9405
]
]
)
# fmt: on
input_speech = self._load_datasamples(1)
feaure_extractor = ClapFeatureExtractor()
for padding, EXPECTED_VALUES in zip(["repeat", "repeatpad", None], EXPECTED_INPUT_FEATURES):
input_features = feaure_extractor(
input_speech, return_tensors="pt", truncation="rand_trunc", padding=padding
).input_features
self.assertTrue(torch.allclose(input_features[0, 0, :30], EXPECTED_VALUES, atol=1e-4))
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Testing suite for the PyTorch CLAP model. """
import inspect
import os
import tempfile
import unittest
import numpy as np
from datasets import load_dataset
from transformers import ClapAudioConfig, ClapConfig, ClapProcessor, ClapTextConfig
from transformers.testing_utils import require_torch, slow, torch_device
from transformers.utils import is_torch_available
from ...test_configuration_common import ConfigTester
from ...test_modeling_common import (
ModelTesterMixin,
_config_zero_init,
floats_tensor,
ids_tensor,
random_attention_mask,
)
if is_torch_available():
import torch
from torch import nn
from transformers import (
ClapAudioModel,
ClapAudioModelWithProjection,
ClapModel,
ClapTextModel,
ClapTextModelWithProjection,
)
from transformers.models.clap.modeling_clap import CLAP_PRETRAINED_MODEL_ARCHIVE_LIST
class ClapAudioModelTester:
def __init__(
self,
parent,
batch_size=12,
image_size=60,
num_mel_bins=16,
window_size=4,
spec_size=64,
patch_size=2,
patch_stride=2,
seq_length=16,
freq_ratio=2,
num_channels=3,
is_training=True,
hidden_size=256,
patch_embeds_hidden_size=32,
projection_dim=32,
num_hidden_layers=4,
num_heads=[2, 2, 2, 2],
intermediate_size=37,
dropout=0.1,
attention_dropout=0.1,
initializer_range=0.02,
scope=None,
):
self.parent = parent
self.batch_size = batch_size
self.image_size = image_size
self.num_mel_bins = num_mel_bins
self.window_size = window_size
self.patch_size = patch_size
self.num_channels = num_channels
self.is_training = is_training
self.hidden_size = hidden_size
self.projection_dim = projection_dim
self.num_hidden_layers = num_hidden_layers
self.num_heads = num_heads
self.num_attention_heads = num_heads[0]
self.seq_length = seq_length
self.spec_size = spec_size
self.freq_ratio = freq_ratio
self.patch_stride = patch_stride
self.patch_embeds_hidden_size = patch_embeds_hidden_size
self.intermediate_size = intermediate_size
self.dropout = dropout
self.attention_dropout = attention_dropout
self.initializer_range = initializer_range
self.scope = scope
def prepare_config_and_inputs(self):
input_features = floats_tensor([self.batch_size, 1, self.hidden_size, self.num_mel_bins])
config = self.get_config()
return config, input_features
def get_config(self):
return ClapAudioConfig(
image_size=self.image_size,
patch_size=self.patch_size,
num_mel_bins=self.num_mel_bins,
window_size=self.window_size,
num_channels=self.num_channels,
hidden_size=self.hidden_size,
patch_stride=self.patch_stride,
projection_dim=self.projection_dim,
num_hidden_layers=self.num_hidden_layers,
num_attention_heads=self.num_heads,
intermediate_size=self.intermediate_size,
dropout=self.dropout,
attention_dropout=self.attention_dropout,
initializer_range=self.initializer_range,
spec_size=self.spec_size,
freq_ratio=self.freq_ratio,
patch_embeds_hidden_size=self.patch_embeds_hidden_size,
)
def create_and_check_model(self, config, input_features):
model = ClapAudioModel(config=config)
model.to(torch_device)
model.eval()
with torch.no_grad():
result = model(input_features)
self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.hidden_size))
def create_and_check_model_with_projection(self, config, input_features):
model = ClapAudioModelWithProjection(config=config)
model.to(torch_device)
model.eval()
with torch.no_grad():
result = model(input_features)
self.parent.assertEqual(result.audio_embeds.shape, (self.batch_size, self.projection_dim))
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
config, input_features = config_and_inputs
inputs_dict = {"input_features": input_features}
return config, inputs_dict
@require_torch
class ClapAudioModelTest(ModelTesterMixin, unittest.TestCase):
"""
Here we also overwrite some of the tests of test_modeling_common.py, as CLAP does not use input_ids, inputs_embeds,
attention_mask and seq_length.
"""
all_model_classes = (ClapAudioModel, ClapAudioModelWithProjection) if is_torch_available() else ()
fx_compatible = False
test_pruning = False
test_resize_embeddings = False
test_head_masking = False
def setUp(self):
self.model_tester = ClapAudioModelTester(self)
self.config_tester = ConfigTester(self, config_class=ClapAudioConfig, has_text_modality=False, hidden_size=37)
def test_config(self):
self.config_tester.run_common_tests()
@unittest.skip(reason="ClapAudioModel does not use inputs_embeds")
def test_inputs_embeds(self):
pass
def test_model_common_attributes(self):
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
model = model_class(config)
self.assertIsInstance(model.get_input_embeddings(), (nn.Module))
x = model.get_output_embeddings()
self.assertTrue(x is None or isinstance(x, nn.Linear))
def test_hidden_states_output(self):
def check_hidden_states_output(inputs_dict, config, model_class):
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
hidden_states = outputs.hidden_states
expected_num_layers = getattr(
self.model_tester, "expected_num_hidden_layers", self.model_tester.num_hidden_layers + 1
)
self.assertEqual(len(hidden_states), expected_num_layers)
self.assertListEqual(
list(hidden_states[0].shape[-2:]),
[self.model_tester.patch_embeds_hidden_size, self.model_tester.patch_embeds_hidden_size],
)
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
inputs_dict["output_hidden_states"] = True
check_hidden_states_output(inputs_dict, config, model_class)
# check that output_hidden_states also work using config
del inputs_dict["output_hidden_states"]
config.output_hidden_states = True
check_hidden_states_output(inputs_dict, config, model_class)
@unittest.skip(reason="ClapAudioModel does not output any loss term in the forward pass")
def test_retain_grad_hidden_states_attentions(self):
pass
def test_forward_signature(self):
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
model = model_class(config)
signature = inspect.signature(model.forward)
# signature.parameters is an OrderedDict => so arg_names order is deterministic
arg_names = [*signature.parameters.keys()]
expected_arg_names = ["input_features"]
self.assertListEqual(arg_names[:1], expected_arg_names)
def test_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_model(*config_and_inputs)
def test_model_with_projection(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_model_with_projection(*config_and_inputs)
@unittest.skip(reason="ClapAudioModel does not output any loss term in the forward pass")
def test_training(self):
pass
@unittest.skip(reason="ClapAudioModel does not output any loss term in the forward pass")
def test_training_gradient_checkpointing(self):
pass
@unittest.skip(reason="ClapAudioModel has no base class and is not available in MODEL_MAPPING")
def test_save_load_fast_init_from_base(self):
pass
@unittest.skip(reason="ClapAudioModel has no base class and is not available in MODEL_MAPPING")
def test_save_load_fast_init_to_base(self):
pass
@slow
def test_model_from_pretrained(self):
for model_name in CLAP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
model = ClapAudioModel.from_pretrained(model_name)
self.assertIsNotNone(model)
@slow
def test_model_with_projection_from_pretrained(self):
for model_name in CLAP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
model = ClapAudioModelWithProjection.from_pretrained(model_name)
self.assertIsNotNone(model)
self.assertTrue(hasattr(model, "visual_projection"))
class ClapTextModelTester:
def __init__(
self,
parent,
batch_size=12,
seq_length=7,
is_training=True,
use_input_mask=True,
use_labels=True,
vocab_size=99,
hidden_size=32,
projection_dim=32,
num_hidden_layers=5,
num_attention_heads=4,
intermediate_size=37,
dropout=0.1,
attention_dropout=0.1,
max_position_embeddings=512,
initializer_range=0.02,
scope=None,
projection_hidden_act="relu",
):
self.parent = parent
self.batch_size = batch_size
self.seq_length = seq_length
self.is_training = is_training
self.use_input_mask = use_input_mask
self.use_labels = use_labels
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.projection_dim = projection_dim
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.dropout = dropout
self.attention_dropout = attention_dropout
self.max_position_embeddings = max_position_embeddings
self.initializer_range = initializer_range
self.scope = scope
self.projection_hidden_act = projection_hidden_act
def prepare_config_and_inputs(self):
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
input_mask = None
if self.use_input_mask:
input_mask = random_attention_mask([self.batch_size, self.seq_length])
if input_mask is not None:
batch_size, seq_length = input_mask.shape
rnd_start_indices = np.random.randint(1, seq_length - 1, size=(batch_size,))
for batch_idx, start_index in enumerate(rnd_start_indices):
input_mask[batch_idx, :start_index] = 1
input_mask[batch_idx, start_index:] = 0
config = self.get_config()
return config, input_ids, input_mask
def get_config(self):
return ClapTextConfig(
vocab_size=self.vocab_size,
hidden_size=self.hidden_size,
projection_dim=self.projection_dim,
num_hidden_layers=self.num_hidden_layers,
num_attention_heads=self.num_attention_heads,
intermediate_size=self.intermediate_size,
dropout=self.dropout,
attention_dropout=self.attention_dropout,
max_position_embeddings=self.max_position_embeddings,
initializer_range=self.initializer_range,
projection_hidden_act=self.projection_hidden_act,
)
def create_and_check_model(self, config, input_ids, input_mask):
model = ClapTextModel(config=config)
model.to(torch_device)
model.eval()
with torch.no_grad():
result = model(input_ids, attention_mask=input_mask)
result = model(input_ids)
self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, self.seq_length, self.hidden_size))
self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.hidden_size))
def create_and_check_model_with_projection(self, config, input_ids, input_mask):
model = ClapTextModelWithProjection(config=config)
model.to(torch_device)
model.eval()
with torch.no_grad():
result = model(input_ids, attention_mask=input_mask)
result = model(input_ids)
self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, self.seq_length, self.hidden_size))
self.parent.assertEqual(result.text_embeds.shape, (self.batch_size, self.projection_dim))
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
config, input_ids, input_mask = config_and_inputs
inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask}
return config, inputs_dict
@require_torch
class ClapTextModelTest(ModelTesterMixin, unittest.TestCase):
all_model_classes = (ClapTextModel, ClapTextModelWithProjection) if is_torch_available() else ()
fx_compatible = False
test_pruning = False
test_head_masking = False
def setUp(self):
self.model_tester = ClapTextModelTester(self)
self.config_tester = ConfigTester(self, config_class=ClapTextConfig, hidden_size=37)
def test_config(self):
self.config_tester.run_common_tests()
def test_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_model(*config_and_inputs)
def test_model_with_projection(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_model_with_projection(*config_and_inputs)
@unittest.skip(reason="ClapTextModel does not output any loss term in the forward pass")
def test_training(self):
pass
@unittest.skip(reason="ClapTextModel does not output any loss term in the forward pass")
def test_training_gradient_checkpointing(self):
pass
@unittest.skip(reason="ClapTextModel does not use inputs_embeds")
def test_inputs_embeds(self):
pass
@unittest.skip(reason="ClapTextModel has no base class and is not available in MODEL_MAPPING")
def test_save_load_fast_init_from_base(self):
pass
@unittest.skip(reason="ClapTextModel has no base class and is not available in MODEL_MAPPING")
def test_save_load_fast_init_to_base(self):
pass
@slow
def test_model_from_pretrained(self):
for model_name in CLAP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
model = ClapTextModel.from_pretrained(model_name)
self.assertIsNotNone(model)
@slow
def test_model_with_projection_from_pretrained(self):
for model_name in CLAP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
model = ClapTextModelWithProjection.from_pretrained(model_name)
self.assertIsNotNone(model)
self.assertTrue(hasattr(model, "text_projection"))
class ClapModelTester:
def __init__(self, parent, text_kwargs=None, audio_kwargs=None, is_training=True):
if text_kwargs is None:
text_kwargs = {}
if audio_kwargs is None:
audio_kwargs = {}
self.parent = parent
self.text_model_tester = ClapTextModelTester(parent, **text_kwargs)
self.audio_model_tester = ClapAudioModelTester(parent, **audio_kwargs)
self.is_training = is_training
def prepare_config_and_inputs(self):
_, input_ids, attention_mask = self.text_model_tester.prepare_config_and_inputs()
_, input_features = self.audio_model_tester.prepare_config_and_inputs()
config = self.get_config()
return config, input_ids, attention_mask, input_features
def get_config(self):
return ClapConfig.from_text_audio_configs(
self.text_model_tester.get_config(), self.audio_model_tester.get_config(), projection_dim=64
)
def create_and_check_model(self, config, input_ids, attention_mask, input_features):
model = ClapModel(config).to(torch_device).eval()
with torch.no_grad():
result = model(input_ids, input_features, attention_mask)
self.parent.assertEqual(
result.logits_per_audio.shape, (self.audio_model_tester.batch_size, self.text_model_tester.batch_size)
)
self.parent.assertEqual(
result.logits_per_text.shape, (self.text_model_tester.batch_size, self.audio_model_tester.batch_size)
)
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
config, input_ids, attention_mask, input_features = config_and_inputs
inputs_dict = {
"input_ids": input_ids,
"attention_mask": attention_mask,
"input_features": input_features,
"return_loss": True,
}
return config, inputs_dict
@require_torch
class ClapModelTest(ModelTesterMixin, unittest.TestCase):
all_model_classes = (ClapModel,) if is_torch_available() else ()
fx_compatible = False
test_head_masking = False
test_pruning = False
test_resize_embeddings = False
test_attention_outputs = False
def setUp(self):
self.model_tester = ClapModelTester(self)
def test_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_model(*config_and_inputs)
@unittest.skip(reason="Hidden_states is tested in individual model tests")
def test_hidden_states_output(self):
pass
@unittest.skip(reason="Inputs_embeds is tested in individual model tests")
def test_inputs_embeds(self):
pass
@unittest.skip(reason="Retain_grad is tested in individual model tests")
def test_retain_grad_hidden_states_attentions(self):
pass
@unittest.skip(reason="ClapModel does not have input/output embeddings")
def test_model_common_attributes(self):
pass
# override as the `logit_scale` parameter initilization is different for CLAP
def test_initialization(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
configs_no_init = _config_zero_init(config)
for model_class in self.all_model_classes:
model = model_class(config=configs_no_init)
for name, param in model.named_parameters():
if param.requires_grad:
# check if `logit_scale` is initilized as per the original implementation
if name == "logit_scale":
self.assertAlmostEqual(
param.data.item(),
np.log(1 / 0.07),
delta=1e-3,
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
)
else:
self.assertIn(
((param.data.mean() * 1e9).round() / 1e9).item(),
[0.0, 1.0],
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
)
def _create_and_check_torchscript(self, config, inputs_dict):
if not self.test_torchscript:
return
configs_no_init = _config_zero_init(config) # To be sure we have no Nan
configs_no_init.torchscript = True
configs_no_init.return_dict = False
for model_class in self.all_model_classes:
model = model_class(config=configs_no_init)
model.to(torch_device)
model.eval()
try:
input_ids = inputs_dict["input_ids"]
input_features = inputs_dict["input_features"] # CLAP needs input_features
traced_model = torch.jit.trace(model, (input_ids, input_features))
except RuntimeError:
self.fail("Couldn't trace module.")
with tempfile.TemporaryDirectory() as tmp_dir_name:
pt_file_name = os.path.join(tmp_dir_name, "traced_model.pt")
try:
torch.jit.save(traced_model, pt_file_name)
except Exception:
self.fail("Couldn't save module.")
try:
loaded_model = torch.jit.load(pt_file_name)
except Exception:
self.fail("Couldn't load module.")
model.to(torch_device)
model.eval()
loaded_model.to(torch_device)
loaded_model.eval()
model_state_dict = model.state_dict()
loaded_model_state_dict = loaded_model.state_dict()
self.assertEqual(set(model_state_dict.keys()), set(loaded_model_state_dict.keys()))
models_equal = True
for layer_name, p1 in model_state_dict.items():
p2 = loaded_model_state_dict[layer_name]
if p1.data.ne(p2.data).sum() > 0:
models_equal = False
self.assertTrue(models_equal)
def test_load_audio_text_config(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
# Save ClapConfig and check if we can load ClapAudioConfig from it
with tempfile.TemporaryDirectory() as tmp_dir_name:
config.save_pretrained(tmp_dir_name)
audio_config = ClapAudioConfig.from_pretrained(tmp_dir_name)
self.assertDictEqual(config.audio_config.to_dict(), audio_config.to_dict())
# Save ClapConfig and check if we can load ClapTextConfig from it
with tempfile.TemporaryDirectory() as tmp_dir_name:
config.save_pretrained(tmp_dir_name)
text_config = ClapTextConfig.from_pretrained(tmp_dir_name)
self.assertDictEqual(config.text_config.to_dict(), text_config.to_dict())
@slow
def test_model_from_pretrained(self):
for model_name in CLAP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
model = ClapModel.from_pretrained(model_name)
self.assertIsNotNone(model)
@slow
@require_torch
class ClapModelIntegrationTest(unittest.TestCase):
paddings = ["repeatpad", "repeat", "pad"]
def test_integration_unfused(self):
EXPECTED_MEANS_UNFUSED = {
"repeatpad": 0.0024,
"pad": 0.0020,
"repeat": 0.0023,
}
librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
audio_sample = librispeech_dummy[-1]
model_id = "laion/clap-htsat-unfused"
model = ClapModel.from_pretrained(model_id).to(torch_device)
processor = ClapProcessor.from_pretrained(model_id)
for padding in self.paddings:
inputs = processor(audios=audio_sample["audio"]["array"], return_tensors="pt", padding=padding).to(
torch_device
)
audio_embed = model.get_audio_features(**inputs)
expected_mean = EXPECTED_MEANS_UNFUSED[padding]
self.assertTrue(
torch.allclose(audio_embed.cpu().mean(), torch.tensor([expected_mean]), atol=1e-3, rtol=1e-3)
)
def test_integration_fused(self):
EXPECTED_MEANS_FUSED = {
"repeatpad": 0.00069,
"repeat": 0.00196,
"pad": -0.000379,
}
librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
audio_sample = librispeech_dummy[-1]
model_id = "laion/clap-htsat-fused"
model = ClapModel.from_pretrained(model_id).to(torch_device)
processor = ClapProcessor.from_pretrained(model_id)
for padding in self.paddings:
inputs = processor(
audios=audio_sample["audio"]["array"], return_tensors="pt", padding=padding, truncation="fusion"
).to(torch_device)
audio_embed = model.get_audio_features(**inputs)
expected_mean = EXPECTED_MEANS_FUSED[padding]
self.assertTrue(
torch.allclose(audio_embed.cpu().mean(), torch.tensor([expected_mean]), atol=1e-3, rtol=1e-3)
)
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import shutil
import tempfile
import unittest
from transformers import ClapFeatureExtractor, ClapProcessor, RobertaTokenizer, RobertaTokenizerFast
from transformers.testing_utils import require_sentencepiece, require_torchaudio
from .test_feature_extraction_clap import floats_list
@require_torchaudio
@require_sentencepiece
class ClapProcessorTest(unittest.TestCase):
def setUp(self):
self.checkpoint = "laion/clap-htsat-unfused"
self.tmpdirname = tempfile.mkdtemp()
def get_tokenizer(self, **kwargs):
return RobertaTokenizer.from_pretrained(self.checkpoint, **kwargs)
def get_feature_extractor(self, **kwargs):
return ClapFeatureExtractor.from_pretrained(self.checkpoint, **kwargs)
def tearDown(self):
shutil.rmtree(self.tmpdirname)
def test_save_load_pretrained_default(self):
tokenizer = self.get_tokenizer()
feature_extractor = self.get_feature_extractor()
processor = ClapProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
processor.save_pretrained(self.tmpdirname)
processor = ClapProcessor.from_pretrained(self.tmpdirname)
self.assertEqual(processor.tokenizer.get_vocab(), tokenizer.get_vocab())
self.assertIsInstance(processor.tokenizer, RobertaTokenizerFast)
self.assertEqual(processor.feature_extractor.to_json_string(), feature_extractor.to_json_string())
self.assertIsInstance(processor.feature_extractor, ClapFeatureExtractor)
def test_save_load_pretrained_additional_features(self):
processor = ClapProcessor(tokenizer=self.get_tokenizer(), feature_extractor=self.get_feature_extractor())
processor.save_pretrained(self.tmpdirname)
tokenizer_add_kwargs = self.get_tokenizer(bos_token="(BOS)", eos_token="(EOS)")
feature_extractor_add_kwargs = self.get_feature_extractor(do_normalize=False, padding_value=1.0)
processor = ClapProcessor.from_pretrained(
self.tmpdirname, bos_token="(BOS)", eos_token="(EOS)", do_normalize=False, padding_value=1.0
)
self.assertEqual(processor.tokenizer.get_vocab(), tokenizer_add_kwargs.get_vocab())
self.assertIsInstance(processor.tokenizer, RobertaTokenizerFast)
self.assertEqual(processor.feature_extractor.to_json_string(), feature_extractor_add_kwargs.to_json_string())
self.assertIsInstance(processor.feature_extractor, ClapFeatureExtractor)
def test_feature_extractor(self):
feature_extractor = self.get_feature_extractor()
tokenizer = self.get_tokenizer()
processor = ClapProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
raw_speech = floats_list((3, 1000))
input_feat_extract = feature_extractor(raw_speech, return_tensors="np")
input_processor = processor(audios=raw_speech, return_tensors="np")
for key in input_feat_extract.keys():
self.assertAlmostEqual(input_feat_extract[key].sum(), input_processor[key].sum(), delta=1e-2)
def test_tokenizer(self):
feature_extractor = self.get_feature_extractor()
tokenizer = self.get_tokenizer()
processor = ClapProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
input_str = "This is a test string"
encoded_processor = processor(text=input_str)
encoded_tok = tokenizer(input_str)
for key in encoded_tok.keys():
self.assertListEqual(encoded_tok[key], encoded_processor[key])
def test_tokenizer_decode(self):
feature_extractor = self.get_feature_extractor()
tokenizer = self.get_tokenizer()
processor = ClapProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
predicted_ids = [[1, 4, 5, 8, 1, 0, 8], [3, 4, 3, 1, 1, 8, 9]]
decoded_processor = processor.batch_decode(predicted_ids)
decoded_tok = tokenizer.batch_decode(predicted_ids)
self.assertListEqual(decoded_tok, decoded_processor)
def test_model_input_names(self):
feature_extractor = self.get_feature_extractor()
tokenizer = self.get_tokenizer()
processor = ClapProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
self.assertListEqual(
processor.model_input_names[2:],
feature_extractor.model_input_names,
msg="`processor` and `feature_extractor` model input names do not match",
)
...@@ -172,6 +172,10 @@ TEST_FILES_WITH_NO_COMMON_TESTS = [ ...@@ -172,6 +172,10 @@ TEST_FILES_WITH_NO_COMMON_TESTS = [
# should **not** be the rule. # should **not** be the rule.
IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [ IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
# models to ignore for model xxx mapping # models to ignore for model xxx mapping
"ClapTextModel",
"ClapTextModelWithProjection",
"ClapAudioModel",
"ClapAudioModelWithProjection",
"Blip2ForConditionalGeneration", "Blip2ForConditionalGeneration",
"Blip2QFormerModel", "Blip2QFormerModel",
"Blip2VisionModel", "Blip2VisionModel",
......
...@@ -40,6 +40,8 @@ src/transformers/models/bloom/configuration_bloom.py ...@@ -40,6 +40,8 @@ src/transformers/models/bloom/configuration_bloom.py
src/transformers/models/camembert/configuration_camembert.py src/transformers/models/camembert/configuration_camembert.py
src/transformers/models/canine/configuration_canine.py src/transformers/models/canine/configuration_canine.py
src/transformers/models/canine/modeling_canine.py src/transformers/models/canine/modeling_canine.py
src/transformers/models/clap/configuration_clap.py
src/transformers/models/clap/modeling_clap.py
src/transformers/models/clip/configuration_clip.py src/transformers/models/clip/configuration_clip.py
src/transformers/models/clipseg/modeling_clipseg.py src/transformers/models/clipseg/modeling_clipseg.py
src/transformers/models/codegen/configuration_codegen.py src/transformers/models/codegen/configuration_codegen.py
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment