Unverified Commit cb45f71c authored by Yoach Lacombe's avatar Yoach Lacombe Committed by GitHub
Browse files

Add Seamless M4T model (#25693)



* first raw commit

* still POC

* tentative convert script

* almost working speech encoder conversion scripts

* intermediate code for encoder/decoders

* add modeling code

* first version of speech encoder

* make style

* add new adapter layer architecture

* add adapter block

* add first tentative config

* add working speech encoder conversion

* base model convert works now

* make style

* remove unnecessary classes

* remove unecessary functions

* add modeling code speech encoder

* rework logics

* forward pass of sub components work

* add modeling codes

* some config modifs and modeling code modifs

* save WIP

* new edits

* same output speech encoder

* correct attention mask

* correct attention mask

* fix generation

* new generation logics

* erase comments

* make style

* fix typo

* add some descriptions

* new state

* clean imports

* add tests

* make style

* make beam search and num_return_sequences>1 works

* correct edge case issue

* correct SeamlessM4TConformerSamePadLayer copied from

* replace ACT2FN relu by nn.relu

* remove unecessary return variable

* move back a class

* change name conformer_attention_mask ->conv_attention_mask

* better nit code

* add some Copied from statements

* small nits

* small nit in dict.get

* rename t2u model -> conditionalgeneration

* ongoing refactoring of structure

* update models architecture

* remove SeamlessM4TMultiModal classes

* add tests

* adapt tests

* some non-working code for vocoder

* add seamlessM4T vocoder

* remove buggy line

* fix some hifigan related bugs

* remove hifigan specifc config

* change

* add WIP tokenization

* add seamlessM4T working tokenzier

* update tokenization

* add tentative feature extractor

* Update converting script

* update working FE

* refactor input_values -> input_features

* update FE

* changes in generation, tokenizer and modeling

* make style and add t2u_decoder_input_ids

* add intermediate outputs for ToSpeech models

* add vocoder to speech models

* update valueerror

* update FE with languages

* add vocoder convert

* update config docstrings and names

* update generation code and configuration

* remove todos and update config.pad_token_id to generation_config.pad_token_id

* move block vocoder

* remove unecessary code and uniformize tospeech code

* add feature extractor import

* make style and fix some copies from

* correct consistency + make fix-copies

* add processor code

* remove comments

* add fast tokenizer support

* correct pad_token_id in M4TModel

* correct config

* update tests and codes  + make style

* make some suggested correstion - correct comments and change naming

* rename some attributes

* rename some attributes

* remove unecessary sequential

* remove option to use dur predictor

* nit

* refactor hifigan

* replace normalize_mean and normalize_var with do_normalize + save lang ids to generation config

* add tests

* change tgt_lang logic

* update generation ToSpeech

* add support import SeamlessM4TProcessor

* fix generate

* make tests

* update integration tests, add option to only return text and update tokenizer fast

* fix wrong function call

* update import and convert script

* update integration tests + update repo id

* correct paths and add first test

* update how new attention masks are computed

* update tests

* take first care of batching in vocoder code

* add batching with the vocoder

* add waveform lengths to model outputs

* make style

* add generate kwargs + forward kwargs of M4TModel

* add docstrings forward methods

* reformate docstrings

* add docstrings t2u model

* add another round of modeling docstrings + reformate speaker_id -> spkr_id

* make style

* fix check_repo

* make style

* add seamlessm4t to toctree

* correct check_config_attributes

* write config docstrings + some modifs

* make style

* add docstrings tokenizer

* add docstrings to processor, fe and tokenizers

* make style

* write first version of model docs

* fix FE + correct FE test

* fix tokenizer + add correct integration tests

* fix most tokenization tests

* make style

* correct most processor test

* add generation tests and fix num_return_sequences > 1

* correct integration tests -still one left

* make style

* correct position embedding

* change numbeams to 1

* refactor some modeling code and correct one test

* make style

* correct typo

* refactor intermediate fnn

* refactor feedforward conformer

* make style

* remove comments

* make style

* fix tokenizer tests

* make style

* correct processor tests

* make style

* correct S2TT integration

* Apply suggestions from Sanchit code review
Co-authored-by: default avatarSanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* correct typo

* replace torch.nn->nn + make style

* change Output naming (waveforms -> waveform) and ordering

* nit renaming and formating

* remove return None when not necessary

* refactor SeamlessM4TConformerFeedForward

* nit typo

* remove almost copied from comments

* add a copied from comment and remove an unecessary dropout

* remove inputs_embeds from speechencoder

* remove backward compatibiliy function

* reformate class docstrings for a few components

* remove unecessary methods

* split over 2 lines smthg hard to read

* make style

* replace two steps offset by one step as suggested

* nice typo

* move warnings

* remove useless lines from processor

* make generation non-standard test more robusts

* remove torch.inference_mode from tests

* split integration tests

* enrich md

* rename control_symbol_vocoder_offset->vocoder_offset

* clean convert file

* remove tgt_lang and src_lang from FE

* change generate docstring of ToText models

* update generate docstring of tospeech models

* unify how to deal withtext_decoder_input_ids

* add default spkr_id

* unify tgt_lang for t2u_model

* simplify tgt_lang verification

* remove a todo

* change config docstring

* make style

* simplify t2u_tgt_lang_id

* make style

* enrich/correct comments

* enrich .md

* correct typo in docstrings

* add torchaudio dependency

* update tokenizer

* make style and fix copies

* modify SeamlessM4TConverter with new tokenizer behaviour

* make style

* correct small typo docs

* fix import

* update docs and add requirement to tests

* add convert_fairseq2_to_hf in utils/not_doctested.txt

* update FE

* fix imports and make style

* remove torchaudio in FE test

* add seamless_m4t.md to utils/not_doctested.txt

* nits and change the way docstring dataset is loaded

* move checkpoints from ylacombe/ to facebook/ orga

* refactor warning/error to be in the 119 line width limit

* round overly precised floats

* add stereo audio behaviour

* refactor .md and make style

* enrich docs with more precised architecture description

* readd undocumented models

* make fix-copies

* apply some suggestions

* Apply suggestions from code review
Co-authored-by: default avatarSanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
Co-authored-by: default avatarArthur <48595927+ArthurZucker@users.noreply.github.com>

* correct bug from previous commit

* refactor a parameter allowing to clean the code + some small nits

* clean tokenizer

* make style and fix

* make style

* clean tokenizers arguments

* add precisions for some tests

* move docs from not_tested to slow

* modify tokenizer according to last comments

* add copied from statements in tests

* correct convert script

* correct parameter docstring style

* correct tokenization

* correct multi gpus

* make style

* clean modeling code

* make style

* add copied from statements

* add copied statements

* add support with ASR pipeline

* remove file added inadvertently

* fix docstrings seamlessM4TModel

* add seamlessM4TConfig to OBJECTS_TO_IGNORE due of unconventional markdown

* add seamlessm4t to assisted generation ignored models

---------
Co-authored-by: default avatarSanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
Co-authored-by: default avatarArthur <48595927+ArthurZucker@users.noreply.github.com>
parent 50d0cf4f
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...utils import (
OptionalDependencyNotAvailable,
_LazyModule,
is_sentencepiece_available,
is_tokenizers_available,
is_torch_available,
)
_import_structure = {
"configuration_seamless_m4t": ["SEAMLESS_M4T_PRETRAINED_CONFIG_ARCHIVE_MAP", "SeamlessM4TConfig"],
"feature_extraction_seamless_m4t": ["SeamlessM4TFeatureExtractor"],
"processing_seamless_m4t": ["SeamlessM4TProcessor"],
}
try:
if not is_sentencepiece_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["tokenization_seamless_m4t"] = ["SeamlessM4TTokenizer"]
try:
if not is_tokenizers_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["tokenization_seamless_m4t_fast"] = ["SeamlessM4TTokenizerFast"]
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["modeling_seamless_m4t"] = [
"SEAMLESS_M4T_PRETRAINED_MODEL_ARCHIVE_LIST",
"SeamlessM4TForTextToSpeech",
"SeamlessM4TForSpeechToSpeech",
"SeamlessM4TForTextToText",
"SeamlessM4TForSpeechToText",
"SeamlessM4TModel",
"SeamlessM4TPreTrainedModel",
"SeamlessM4TCodeHifiGan",
"SeamlessM4THifiGan",
"SeamlessM4TTextToUnitForConditionalGeneration",
"SeamlessM4TTextToUnitModel",
]
if TYPE_CHECKING:
from .configuration_seamless_m4t import SEAMLESS_M4T_PRETRAINED_CONFIG_ARCHIVE_MAP, SeamlessM4TConfig
from .feature_extraction_seamless_m4t import SeamlessM4TFeatureExtractor
from .processing_seamless_m4t import SeamlessM4TProcessor
try:
if not is_sentencepiece_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .tokenization_seamless_m4t import SeamlessM4TTokenizer
try:
if not is_tokenizers_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .tokenization_seamless_m4t_fast import SeamlessM4TTokenizerFast
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .modeling_seamless_m4t import (
SEAMLESS_M4T_PRETRAINED_MODEL_ARCHIVE_LIST,
SeamlessM4TCodeHifiGan,
SeamlessM4TForSpeechToSpeech,
SeamlessM4TForSpeechToText,
SeamlessM4TForTextToSpeech,
SeamlessM4TForTextToText,
SeamlessM4THifiGan,
SeamlessM4TModel,
SeamlessM4TPreTrainedModel,
SeamlessM4TTextToUnitForConditionalGeneration,
SeamlessM4TTextToUnitModel,
)
else:
import sys
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" SeamlessM4T model configuration"""
from ...configuration_utils import PretrainedConfig
from ...utils import logging
logger = logging.get_logger(__name__)
SEAMLESS_M4T_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"facebook/hf-seamless-m4t-medium": "https://huggingface.co/facebook/hf-seamless-m4t-medium/resolve/main/config.json",
# See all SeamlessM4T models at https://huggingface.co/models?filter=seamless_m4t
}
class SeamlessM4TConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`~SeamlessM4TModel`]. It is used to instantiate an
SeamlessM4T model according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the SeamlessM4T
["facebook/hf-seamless-m4t-medium"](https://huggingface.co/"facebook/hf-seamless-m4t-medium") architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
vocab_size (`int`, *optional*, defaults to 256102):
Vocabulary size of the SeamlessM4T model. Defines the number of different tokens that can be represented by
the `inputs_ids` passed when calling [`~SeamlessM4TModel`], [`~SeamlessM4TForTextToSpeech`] or
[`~SeamlessM4TForTextToText`].
t2u_vocab_size (`int`, *optional*, defaults to 10082):
Unit vocabulary size of the SeamlessM4T model. Defines the number of different unit tokens that can be
represented by the `inputs_ids` passed when calling the Text-To-Units sub-model of [`~SeamlessM4TModel`],
[`~SeamlessM4TForSpeechToSpeech`] or [`~SeamlessM4TForTextToSpeech`].
> Parameters shared across sub-models
hidden_size (`int`, *optional*, defaults to 1024):
Dimensionality of the "intermediate" layers in the architecture.
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (`float`, *optional*, defaults to 1e-05):
The epsilon used by the layer normalization layers.
use_cache (`bool`, *optional*, defaults to `True`):
Whether or not the model should return the last key/values attentions (not used by all models).
max_position_embeddings (`int`, *optional*, defaults to 1024):
The maximum sequence length that this model text encoder and decoder might ever be used with. Typically set
this to something large just in case (e.g., 512 or 1024 or 2048).
is_encoder_decoder (`bool`, *optional*, defaults to `True`):
Whether the model is used as an encoder/decoder or not.
encoder_layerdrop (`float`, *optional*, defaults to 0.05):
The LayerDrop probability for the encoders. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
for more details.
decoder_layerdrop (`float`, *optional*, defaults to 0.05):
The LayerDrop probability for the decoders. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
for more details.
activation_function (`str` or `function`, *optional*, defaults to `"relu"`):
The non-linear activation function (function or string) in the decoder and feed-forward layers. If string,
`"gelu"`, `"relu"`, `"selu"`, `"swish"` and `"gelu_new"` are supported.
dropout (`float`, *optional*, defaults to 0.1):
The dropout probability for all fully connected layers in the embeddings, encoder, decoder, and pooler.
attention_dropout (`float`, *optional*, defaults to 0.1):
The dropout probability for all attention layers.
activation_dropout (`float`, *optional*, defaults to 0.0):
The dropout probability for all activation layers in the model.
scale_embedding (`bool`, *optional*, defaults to `True`):
Scale embeddings by diving by sqrt(d_model).
> Text encoder and text decoder specific parameters
encoder_layers (`int`, *optional*, defaults to 24):
Number of hidden layers in the Transformer text encoder.
encoder_ffn_dim (`int`, *optional*, defaults to 8192):
Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer text encoder.
encoder_attention_heads (`int`, *optional*, defaults to 16):
Number of attention heads for each attention layer in the Transformer text encoder.
decoder_layers (`int`, *optional*, defaults to 24):
Number of hidden layers in the Transformer text decoder.
decoder_ffn_dim (`int`, *optional*, defaults to 8192):
Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer text decoder.
decoder_attention_heads (`int`, *optional*, defaults to 16):
Number of attention heads for each attention layer in the Transformer text decoder.
decoder_start_token_id (`int`, *optional*, defaults to 3):
If an encoder-decoder model starts decoding with a different token than _bos_, the id of that token. Only
applied in the text decoder.
max_new_tokens (`int`, *optional*, defaults to 256):
The maximum numbers of text tokens to generate, ignoring the number of tokens in the prompt.
pad_token_id (`int`, *optional*, defaults to 0):
The id of the _padding_ text token. Only applied to the text-decoder model.
bos_token_id (`int`, *optional*, defaults to 2):
The id of the _beginning-of-stream_ text token. Only applied to the text-decoder model.
eos_token_id (`int`, *optional*, defaults to 3):
The id of the _end-of-stream_ text token. Only applied to the text-decoder model.
> Speech encoder specific parameters
speech_encoder_layers (`int`, *optional*, defaults to 24):
Number of hidden layers in the Transformer speech encoder.
speech_encoder_attention_heads (`int`, *optional*, defaults to 16):
Number of attention heads for each attention layer in the Transformer speech encoder.
speech_encoder_intermediate_size (`int`, *optional*, defaults to 4096):
Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer speech encoder.
speech_encoder_hidden_act (`str` or `function`, *optional*, defaults to `"swish"`):
The non-linear activation function (function or string) in the speech encoder. If string, `"gelu"`,
`"relu"`, `"selu"`, `"swish"` and `"gelu_new"` are supported.
speech_encoder_dropout (`float`, *optional*, defaults to 0.0):
The dropout probability for all layers in the speech encoder.
add_adapter (`bool`, *optional*, defaults to `True`):
Add an adapter layer on top of the speech encoder.
speech_encoder_layerdrop (`float`, *optional*, defaults to 0.1):
The LayerDrop probability for the speech encoder. See the [LayerDrop paper](see
https://arxiv.org/abs/1909.11556) for more details.
feature_projection_input_dim (`int`, *optional*, defaults to 160):
Input dimension of the input feature projection of the speech encoder, i.e the dimension after processing
input audios with [`SeamlessM4TFeatureExtractor`].
num_conv_pos_embeddings (`int`, *optional*, defaults to 128):
Number of convolutional positional embeddings. Defines the kernel size of 1D convolutional positional
embeddings layer of the speech encoder.
num_conv_pos_embedding_groups (`int`, *optional*, defaults to 16):
Number of groups of 1D convolutional positional embeddings layer of the speech encoder.
adaptor_kernel_size (`int`, *optional*, defaults to 8):
Kernel size of the convolutional layers in the adapter network. Only relevant if `add_adapter is True`.
adaptor_stride (`int`, *optional*, defaults to 8):
Stride of the convolutional layers in the adapter network. Only relevant if `add_adapter is True`.
adaptor_dropout (`float`, *optional*, defaults to 0.1):
The dropout probability for all layers in the speech adapter.
num_adapter_layers (`int`, *optional*, defaults to 1):
Number of convolutional layers that should be used in the adapter network. Only relevant if `add_adapter is
True`.
position_embeddings_type (`str`, *optional*, defaults to `"relative"`):
Can be specified to `relative` or `rotary` for relative or rotary position embeddings respectively. If left
`None` no relative position embedding is applied. Only applied to the speech encoder.
rotary_embedding_base (`int`, *optional*, defaults to 10000):
If `"rotary"` position embeddings are used, defines the size of the embedding base. Only applied to the
speech encoder.
max_source_positions (`int`, *optional*, defaults to 4096):
if `"relative"` position embeddings are used, defines the maximum source input positions. Only applied to
the speech encoder.
conv_depthwise_kernel_size (`int`, *optional*, defaults to 31):
Kernel size of convolutional depthwise 1D layer in Conformer blocks. Only applied to the speech encoder.
> Text-To-Unit (t2u) model specific parameters
t2u_bos_token_id (`int`, *optional*, defaults to 0):
The id of the _beginning-of-stream_ unit token. Only applied to the text-to-unit seq2seq model.
t2u_pad_token_id (`int`, *optional*, defaults to 1):
The id of the _padding_ unit token. Only applied to the text-to-unit seq2seq model.
t2u_eos_token_id (`int`, *optional*, defaults to 2):
The id of the _end-of-stream_ unit token. Only applied to the text-to-unit seq2seq model.
t2u_decoder_start_token_id (`int`, *optional*, defaults to 2):
If an encoder-decoder model starts decoding with a different token than _bos_, the id of that token. Only
applied to the text-to-unit seq2seq model.
t2u_max_new_tokens (`int`, *optional*, defaults to 1024):
The maximum numbers of unit tokens to generate, ignoring the number of tokens in the prompt. Only applied
to the text-to-unit seq2seq model.
t2u_encoder_layers (`int`, *optional*, defaults to 6):
Number of hidden layers in the Transformer text-to-unit encoder.
t2u_encoder_ffn_dim (`int`, *optional*, defaults to 8192):
Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer text-to-unit encoder.
t2u_encoder_attention_heads (`int`, *optional*, defaults to 16):
Number of attention heads for each attention layer in the Transformer text-to-unit encoder.
t2u_decoder_layers (`int`, *optional*, defaults to 6):
Number of hidden layers in the Transformer text-to-unit decoder.
t2u_decoder_ffn_dim (`int`, *optional*, defaults to 8192):
Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer text-to-unit decoder.
t2u_decoder_attention_heads (`int`, *optional*, defaults to 16):
Number of attention heads for each attention layer in the Transformer text-to-unit decoder.
t2u_max_position_embeddings (`int`, *optional*, defaults to 2048):
The maximum sequence length that this model text-to-unit component might ever be used with. Typically set
this to something large just in case (e.g., 512 or 1024 or 2048).
> Hifi-Gan Vocoder specific parameters
sampling_rate (`int`, *optional*, defaults to 16000):
The sampling rate at which the output audio will be generated, expressed in hertz (Hz).
upsample_initial_channel (`int`, *optional*, defaults to 512):
The number of input channels into the hifi-gan upsampling network. Applies to the vocoder only.
upsample_rates (`Tuple[int]` or `List[int]`, *optional*, defaults to `[5, 4, 4, 2, 2]`):
A tuple of integers defining the stride of each 1D convolutional layer in the vocoder upsampling network.
The length of *upsample_rates* defines the number of convolutional layers and has to match the length of
*upsample_kernel_sizes*. Applies to the vocoder only.
upsample_kernel_sizes (`Tuple[int]` or `List[int]`, *optional*, defaults to `[11, 8, 8, 4, 4]`):
A tuple of integers defining the kernel size of each 1D convolutional layer in the vocoder upsampling
network. The length of *upsample_kernel_sizes* defines the number of convolutional layers and has to match
the length of *upsample_rates*. Applies to the vocoder only.
resblock_kernel_sizes (`Tuple[int]` or `List[int]`, *optional*, defaults to `[3, 7, 11]`):
A tuple of integers defining the kernel sizes of the vocoder 1D convolutional layers in the multi-receptive
field fusion (MRF) module. Applies to the vocoder only.
resblock_dilation_sizes (`Tuple[Tuple[int]]` or `List[List[int]]`, *optional*, defaults to `[[1, 3, 5], [1, 3, 5], [1, 3, 5]]`):
A nested tuple of integers defining the dilation rates of the vocoder dilated 1D convolutional layers in
the multi-receptive field fusion (MRF) module. Applies to the vocoder only.
leaky_relu_slope (`float`, *optional*, defaults to 0.1):
The angle of the negative slope used by the leaky ReLU activation in the vocoder. Applies to the vocoder
only.
unit_hifi_gan_vocab_size (`int`, *optional*, defaults to 10000):
Vocabulary size of the SeamlessM4T vocoder. Defines the number of different unit tokens that can be
represented by the `inputs_ids` passed when calling the vocoder of [`~SeamlessM4TModel`],
[`~SeamlessM4TForSpeechToSpeech`] or [`~SeamlessM4TForTextToSpeech`].
unit_embed_dim (`int`, *optional*, defaults to 1280):
The projection dimension of the input ids given to the hifi-gan vocoder. Applies to the vocoder only.
lang_embed_dim (`int`, *optional*, defaults to 256):
The projection dimension of the target language given to the hifi-gan vocoder. Applies to the vocoder only.
spkr_embed_dim (`int`, *optional*, defaults to 256):
The projection dimension of the speaker id given to the hifi-gan vocoder. Applies to the vocoder only.
vocoder_num_langs (`int`, *optional*, defaults to 36):
Number of langs supported by the vocoder. Might be different from `t2u_num_langs`.
vocoder_num_spkrs (`int`, *optional*, defaults to 200):
Number of speakers supported by the vocoder.
variance_predictor_kernel_size (`int`, *optional*, defaults to 3):
Kernel size of the duration predictor. Applies to the vocoder only.
var_pred_dropout (`float`, *optional*, defaults to 0.5):
The dropout probabilitiy of the duration predictor. Applies to the vocoder only.
vocoder_offset (`int`, *optional*, defaults to 4):
Offset the unit token ids by this number to account for symbol tokens. Applies to the vocoder only.
```python
>>> from transformers import SeamlessM4TModel, SeamlessM4TConfig
>>> # Initializing a SeamlessM4T "facebook/hf-seamless-m4t-medium" style configuration
>>> configuration = SeamlessM4TConfig()
>>> # Initializing a model from the "facebook/hf-seamless-m4t-medium" style configuration
>>> model = SeamlessM4TModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "seamless_m4t"
def __init__(
self,
vocab_size=256102,
t2u_vocab_size=10082,
# shared config
hidden_size=1024,
initializer_range=0.02,
layer_norm_eps=1e-5,
use_cache=True,
max_position_embeddings=1024,
is_encoder_decoder=True,
encoder_layerdrop=0.05,
decoder_layerdrop=0.05,
activation_function="relu",
dropout=0.1,
attention_dropout=0.1,
activation_dropout=0.0,
scale_embedding=True,
# text encoder|decoder
encoder_layers=24,
encoder_ffn_dim=8192,
encoder_attention_heads=16,
decoder_layers=24,
decoder_ffn_dim=8192,
decoder_attention_heads=16,
decoder_start_token_id=3,
max_new_tokens=256,
pad_token_id=0,
bos_token_id=2,
eos_token_id=3,
# speech_encoder
speech_encoder_layers=24,
speech_encoder_attention_heads=16,
speech_encoder_intermediate_size=4096,
speech_encoder_hidden_act="swish",
speech_encoder_dropout=0.0,
add_adapter=True,
speech_encoder_layerdrop=0.1,
feature_projection_input_dim=160,
num_conv_pos_embeddings=128,
num_conv_pos_embedding_groups=16,
adaptor_kernel_size=8,
adaptor_stride=8,
adaptor_dropout=0.1,
num_adapter_layers=1,
position_embeddings_type="relative",
rotary_embedding_base=10000,
max_source_positions=4096,
conv_depthwise_kernel_size=31,
# t2u config
t2u_bos_token_id=0,
t2u_pad_token_id=1,
t2u_eos_token_id=2,
t2u_decoder_start_token_id=2,
t2u_max_new_tokens=1024,
t2u_encoder_layers=6,
t2u_encoder_ffn_dim=8192,
t2u_encoder_attention_heads=16,
t2u_decoder_layers=6,
t2u_decoder_ffn_dim=8192,
t2u_decoder_attention_heads=16,
t2u_max_position_embeddings=2048,
# hifi-gan vocoder config
sampling_rate=16000,
upsample_initial_channel=512,
upsample_rates=[5, 4, 4, 2, 2],
upsample_kernel_sizes=[11, 8, 8, 4, 4],
resblock_kernel_sizes=[3, 7, 11],
resblock_dilation_sizes=[[1, 3, 5], [1, 3, 5], [1, 3, 5]],
leaky_relu_slope=0.1,
# specific to Code Hifi-Gan
unit_hifi_gan_vocab_size=10000,
unit_embed_dim=1280,
lang_embed_dim=256,
spkr_embed_dim=256,
vocoder_num_langs=36,
vocoder_num_spkrs=200,
variance_predictor_kernel_size=3,
var_pred_dropout=0.5,
vocoder_offset=4,
**kwargs,
):
# overall_config
self.vocab_size = vocab_size
self.t2u_vocab_size = t2u_vocab_size
self.hidden_size = hidden_size
self.initializer_range = initializer_range
self.layer_norm_eps = layer_norm_eps
self.max_position_embeddings = max_position_embeddings
self.use_cache = use_cache
self.max_new_tokens = max_new_tokens
self.encoder_layerdrop = encoder_layerdrop
self.decoder_layerdrop = decoder_layerdrop
self.activation_function = activation_function
self.dropout = dropout
self.attention_dropout = attention_dropout
self.activation_dropout = activation_dropout
self.scale_embedding = scale_embedding
# for proper config init
self.num_attention_heads = decoder_attention_heads
self.num_hidden_layers = decoder_layers
# text|unit encoder|decoder
self.encoder_layers = encoder_layers
self.encoder_ffn_dim = encoder_ffn_dim
self.encoder_attention_heads = encoder_attention_heads
self.decoder_layers = decoder_layers
self.decoder_ffn_dim = decoder_ffn_dim
self.decoder_attention_heads = decoder_attention_heads
# speech_encoder
self.speech_encoder_layers = speech_encoder_layers
self.speech_encoder_hidden_act = speech_encoder_hidden_act
self.speech_encoder_dropout = speech_encoder_dropout
self.speech_encoder_attention_heads = speech_encoder_attention_heads
self.speech_encoder_layerdrop = speech_encoder_layerdrop
self.speech_encoder_intermediate_size = speech_encoder_intermediate_size
self.feature_projection_input_dim = feature_projection_input_dim
self.num_conv_pos_embeddings = num_conv_pos_embeddings
self.num_conv_pos_embedding_groups = num_conv_pos_embedding_groups
self.adaptor_kernel_size = adaptor_kernel_size
self.adaptor_stride = adaptor_stride
self.adaptor_dropout = adaptor_dropout
self.num_adapter_layers = num_adapter_layers
self.position_embeddings_type = position_embeddings_type
self.rotary_embedding_base = rotary_embedding_base
self.max_source_positions = max_source_positions
self.conv_depthwise_kernel_size = conv_depthwise_kernel_size
self.add_adapter = add_adapter
# t2u config
self.t2u_bos_token_id = t2u_bos_token_id
self.t2u_pad_token_id = t2u_pad_token_id
self.t2u_eos_token_id = t2u_eos_token_id
self.t2u_decoder_start_token_id = t2u_decoder_start_token_id
self.t2u_max_new_tokens = t2u_max_new_tokens
self.t2u_encoder_layers = t2u_encoder_layers
self.t2u_encoder_ffn_dim = t2u_encoder_ffn_dim
self.t2u_encoder_attention_heads = t2u_encoder_attention_heads
self.t2u_decoder_layers = t2u_decoder_layers
self.t2u_decoder_ffn_dim = t2u_decoder_ffn_dim
self.t2u_decoder_attention_heads = t2u_decoder_attention_heads
self.t2u_max_position_embeddings = t2u_max_position_embeddings
# hifi-gan vocoder config
# original parameters specific to Hifi-Gan
self.sampling_rate = sampling_rate
self.upsample_initial_channel = upsample_initial_channel
self.upsample_rates = upsample_rates
self.upsample_kernel_sizes = upsample_kernel_sizes
self.resblock_kernel_sizes = resblock_kernel_sizes
self.resblock_dilation_sizes = resblock_dilation_sizes
self.leaky_relu_slope = leaky_relu_slope
# specific to Code Hifi-Gan
self.unit_hifi_gan_vocab_size = unit_hifi_gan_vocab_size
self.unit_embed_dim = unit_embed_dim
self.lang_embed_dim = lang_embed_dim
self.spkr_embed_dim = spkr_embed_dim
self.vocoder_num_langs = vocoder_num_langs
self.vocoder_num_spkrs = vocoder_num_spkrs
self.variance_predictor_kernel_size = variance_predictor_kernel_size
self.var_pred_dropout = var_pred_dropout
self.vocoder_offset = vocoder_offset
super().__init__(
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
eos_token_id=eos_token_id,
decoder_start_token_id=decoder_start_token_id,
is_encoder_decoder=is_encoder_decoder,
max_position_embeddings=max_position_embeddings,
**kwargs,
)
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Converting Meta SeamlessM4T checkpoints from seamless_communication to HF."""
import argparse
import os
from pathlib import Path
import torch
from accelerate.utils.modeling import find_tied_parameters
from seamless_communication.models.inference.translator import Translator
from transformers import (
SeamlessM4TConfig,
SeamlessM4TFeatureExtractor,
SeamlessM4TModel,
SeamlessM4TProcessor,
SeamlessM4TTokenizer,
)
from transformers.utils import logging
# fmt: off
UNIT_SUPPORTED_LANGUAGES = ["__arb__", "__ben__", "__cat__", "__ces__", "__cmn__", "__cym__", "__dan__", "__deu__", "__eng__", "__est__", "__fin__", "__fra__", "__hin__", "__ind__", "__ita__", "__jpn__", "__kan__", "__kor__", "__mlt__", "__nld__", "__pes__", "__pol__", "__por__", "__ron__", "__rus__", "__slk__", "__spa__", "__swe__", "__swh__", "__tam__", "__tel__", "__tgl__", "__tha__", "__tur__", "__ukr__", "__urd__", "__uzn__", "__vie__", ]
# fmt: on
# fmt: off
VOCODER_SUPPORTED_LANGUAGES = ["__arb__", "__ben__", "__cat__", "__ces__", "__cmn__", "__cym__", "__dan__", "__deu__", "__eng__", "__est__", "__fin__", "__fra__", "__hin__", "__ind__", "__ita__", "__jpn__", "__kor__", "__mlt__", "__nld__", "__pes__", "__pol__", "__por__", "__ron__", "__rus__", "__slk__", "__spa__", "__swe__", "__swh__", "__tel__", "__tgl__", "__tha__", "__tur__", "__ukr__", "__urd__", "__uzn__", "__vie__",]
# fmt: on
# fmt: off
MEDIUM_SUPPORTED_LANGUAGES = ["ace","ace_Latn","acm","acq","aeb","afr","ajp","aka","amh","apc","arb","ars","ary","arz","asm","ast","awa","ayr","azb","azj","bak","bam","ban","bel","bem","ben","bho","bjn","bjn_Latn","bod","bos","bug","bul","cat","ceb","ces","cjk","ckb","crh","cym","dan","deu","dik","dyu","dzo","ell","eng","epo","est","eus","ewe","fao","pes","fij","fin","fon","fra","fur","fuv","gla","gle","glg","grn","guj","hat","hau","heb","hin","hne","hrv","hun","hye","ibo","ilo","ind","isl","ita","jav","jpn","kab","kac","kam","kan","kas","kas_Deva","kat","knc","knc_Latn","kaz","kbp","kea","khm","kik","kin","kir","kmb","kon","kor","kmr","lao","lvs","lij","lim","lin","lit","lmo","ltg","ltz","lua","lug","luo","lus","mag","mai","mal","mar","min","mkd","plt","mlt","mni","khk","mos","mri","zsm","mya","nld","nno","nob","npi","nso","nus","nya","oci","gaz","ory","pag","pan","pap","pol","por","prs","pbt","quy","ron","run","rus","sag","san","sat","scn","shn","sin","slk","slv","smo","sna","snd","som","sot","spa","als","srd","srp","ssw","sun","swe","swh","szl","tam","tat","tel","tgk","tgl","tha","tir","taq","taq_Tfng","tpi","tsn","tso","tuk","tum","tur","twi","tzm","uig","ukr","umb","urd","uzn","vec","vie","war","wol","xho","ydd","yor","yue","cmn","cmn_Hant","zul",]
# fmt: on
# fmt: off
LARGE_SUPPORTED_LANGUAGES = ["afr","amh","arb","ary","arz","asm","azj","bel","ben","bos","bul","cat","ceb","ces","ckb","cmn","cmn_Hant","cym","dan","deu","ell","eng","est","eus","fin","fra","fuv","gaz","gle","glg","guj","heb","hin","hrv","hun","hye","ibo","ind","isl","ita","jav","jpn","kan","kat","kaz","khk","khm","kir","kor","lao","lit","lug","luo","lvs","mai","mal","mar","mkd","mlt","mni","mya","nld","nno","nob","npi","nya","ory","pan","pbt","pes","pol","por","ron","rus","sat","slk","slv","sna","snd","som","spa","srp","swe","swh","tam","tel","tgk","tgl","tha","tur","ukr","urd","uzn","vie","yor","yue","zlm","zul",]
# fmt: on
def assert_param_count(model_1, model_2):
count_1 = sum(p[1].numel() for p in model_1.named_parameters() if "final_proj" not in p[0])
count_2 = sum(p[1].numel() for p in model_2.named_parameters() if "final_proj" not in p[0])
assert count_1 == count_2, f"{model_1.__class__}: {count_1} != {model_2.__class__}: {count_2}"
def param_count(model):
return sum(p[1].numel() for p in model.named_parameters() if "final_proj" not in p[0])
def _grab_best_device(use_gpu=True):
if torch.cuda.device_count() > 0 and use_gpu:
device = "cuda"
else:
device = "cpu"
return torch.device(device)
logging.set_verbosity_info()
logger = logging.get_logger(__name__)
vocoder_convert_list = [
("ups", "hifi_gan.upsampler"),
("conv_pre", "hifi_gan.conv_pre"),
("resblocks", "hifi_gan.resblocks"),
("conv_post", "hifi_gan.conv_post"),
("lang", "language_embedding"),
("spkr", "speaker_embedding"),
("dict.", "unit_embedding."),
("dur_predictor.conv1.0", "dur_predictor.conv1"),
("dur_predictor.conv2.0", "dur_predictor.conv2"),
]
# order is important
wav2vec_convert_list = [
("speech_encoder_frontend.model_dim_proj", "feature_projection.projection"),
("speech_encoder_frontend.post_extract_layer_norm", "feature_projection.layer_norm"),
("speech_encoder_frontend.pos_encoder.conv", "encoder.pos_conv_embed.conv"),
("speech_encoder.inner.layers", "encoder.layers"),
("speech_encoder.inner_layer_norm", "encoder.layer_norm"),
("speech_encoder.adaptor_layers", "adapter.layers"),
("inner_proj", "intermediate_dense"),
("self_attn.output_proj", "self_attn.linear_out"),
("output_proj", "output_dense"),
("self_attn.k_proj", "self_attn.linear_k"),
("self_attn.v_proj", "self_attn.linear_v"),
("self_attn.q_proj", "self_attn.linear_q"),
("self_attn.sdpa.u_bias", "self_attn.pos_bias_u"),
("self_attn.sdpa.v_bias", "self_attn.pos_bias_v"),
("self_attn.sdpa.r_proj", "self_attn.linear_pos"),
("conv.pointwise_conv1", "conv_module.pointwise_conv1"),
("conv.pointwise_conv2", "conv_module.pointwise_conv2"),
("conv.depthwise_conv", "conv_module.depthwise_conv"),
("conv.batch_norm", "conv_module.batch_norm"),
("conv_layer_norm", "conv_module.layer_norm"),
("speech_encoder.proj1", "intermediate_ffn.intermediate_dense"),
("speech_encoder.proj2", "intermediate_ffn.output_dense"),
("speech_encoder.layer_norm", "inner_layer_norm"),
]
t2u_convert_list = [
("t2u_model.final_proj", "lm_head"),
("t2u_model.", "model."),
("encoder_decoder_attn_layer_norm", "cross_attention_layer_norm"),
("encoder_decoder_attn", "cross_attention"),
("linear_k", "k_proj"),
("linear_v", "v_proj"),
("linear_q", "q_proj"),
("ffn.inner_proj", "ffn.fc1"),
("ffn.output_proj", "ffn.fc2"),
("output_proj", "out_proj"),
("decoder_frontend.embed", "decoder.embed_tokens"),
]
text_convert_list = [
("text_encoder.", ""),
("text_decoder.", ""),
("text_encoder_frontend.embed", "embed_tokens"),
("text_decoder_frontend.embed", "embed_tokens"),
("encoder_decoder_attn_layer_norm", "cross_attention_layer_norm"),
("encoder_decoder_attn", "cross_attention"),
("linear_k", "k_proj"),
("linear_v", "v_proj"),
("linear_q", "q_proj"),
("ffn.inner_proj", "ffn.fc1"),
("ffn.output_proj", "ffn.fc2"),
("output_proj", "out_proj"),
("final_proj", "lm_head"),
]
CUR_PATH = os.path.dirname(os.path.abspath(__file__))
default_cache_dir = os.path.join(os.path.expanduser("~"), ".cache")
CACHE_DIR = os.path.join(os.getenv("XDG_CACHE_HOME", default_cache_dir), "huggingface", "hub")
def _load_hf_config(model_type="medium"):
if model_type == "medium":
kwargs = {
"vocab_size": 256206,
"t2u_vocab_size": 10082,
"hidden_size": 1024,
"max_position_embeddings": 4096,
"encoder_layers": 12,
"decoder_layers": 12,
"encoder_ffn_dim": 4096,
"decoder_ffn_dim": 4096,
"t2u_encoder_layers": 4,
"t2u_decoder_layers": 4,
"speech_encoder_layers": 12,
}
return SeamlessM4TConfig(**kwargs)
else:
return SeamlessM4TConfig()
def _convert_model(
original_model,
hf_model,
convert_list,
device,
unwanted_prefix="model.",
filter_state_dict="speech",
exclude_state_dict=None,
):
state_dict = original_model.state_dict()
# filter func
if isinstance(filter_state_dict, str):
def filter_func(x):
return filter_state_dict in x[0]
else:
def filter_func(item):
if exclude_state_dict is not None and exclude_state_dict in item[0]:
return False
for filter_el in filter_state_dict:
if filter_el in item[0]:
return True
return False
state_dict = dict(filter(filter_func, state_dict.items()))
for k, v in list(state_dict.items()):
new_k = k[len(unwanted_prefix) :]
for old_layer_name, new_layer_name in convert_list:
if old_layer_name in new_k:
new_k = new_k.replace(old_layer_name, new_layer_name)
# must do it by hand
if ".layer_norm" in new_k and new_k.split(".layer_norm")[0][-1].isnumeric():
new_k = new_k.replace("layer_norm", "final_layer_norm")
state_dict[new_k] = state_dict.pop(k)
extra_keys = set(state_dict.keys()) - set(hf_model.state_dict().keys())
extra_keys = set(extra_keys)
missing_keys = set(hf_model.state_dict().keys()) - set(state_dict.keys())
missing_keys = set({k for k in missing_keys if "final_logits_bias" not in k})
if len(extra_keys) != 0:
raise ValueError(f"extra keys found: {extra_keys}")
if len(missing_keys) != 0:
raise ValueError(f"missing keys: {missing_keys}")
hf_model.load_state_dict(state_dict, strict=False)
n_params = param_count(hf_model)
logger.info(f"model loaded: {round(n_params/1e6,1)}M params")
hf_model.eval()
hf_model.to(device)
del state_dict
return hf_model
def load_model(save_dir, model_type, repo_id):
"""
Meta SeamlessM4T is made of 8 main components:
- speech_encoder (#1) and speech_encoder_frontend (#2)
- t2u_model (#3)
- text_encoder (#4) and text_encoder_frontend (#5)
- text_decoder (#6) [and text_decoder_frontend (#5) = equals to text_encoder_frontend]
- final_proj (#7)
- vocoder (#8)
"""
device = _grab_best_device()
if model_type == "medium":
name = "seamlessM4T_medium"
else:
name = "seamlessM4T_large"
original_model = Translator(name, "vocoder_36langs", device, torch.float32)
######### TOKENIZER
langs = MEDIUM_SUPPORTED_LANGUAGES if model_type == "medium" else LARGE_SUPPORTED_LANGUAGES
langs = [f"__{lang}__" for lang in langs]
vocab_file = os.path.join(os.path.expanduser("~"), "tokenizer", model_type, "tokenizer.model")
save_dir = os.path.join(save_dir, name)
Path(save_dir).mkdir(exist_ok=True)
tokenizer = SeamlessM4TTokenizer(vocab_file, additional_special_tokens=langs)
sanity_check_lang_id = tokenizer.convert_tokens_to_ids("__fra__")
tokenizer.save_pretrained(save_dir)
tokenizer = SeamlessM4TTokenizer.from_pretrained(save_dir)
if sanity_check_lang_id != tokenizer.convert_tokens_to_ids("__fra__"):
raise ValueError(
f"Error in tokenizer saving/loading - __fra__ lang id is not coherent: {sanity_check_lang_id} vs {tokenizer.convert_tokens_to_ids('__fra__')}"
)
####### get language to ids dict
text_decoder_lang_code_to_id = {lang.replace("__", ""): tokenizer.convert_tokens_to_ids(lang) for lang in langs}
# offset: vocoder unit vocab size + 5 (for EOS/PAD/BOS/UNK/MSK) + len(supported_languages)
t2u_lang_code_to_id = {
code.replace("__", ""): i + 10005 + len(UNIT_SUPPORTED_LANGUAGES)
for i, code in enumerate(UNIT_SUPPORTED_LANGUAGES)
}
vocoder_lang_code_to_id = {code.replace("__", ""): i for i, code in enumerate(VOCODER_SUPPORTED_LANGUAGES)}
######### FE
fe = SeamlessM4TFeatureExtractor(language_code=langs)
fe.save_pretrained(save_dir)
fe = SeamlessM4TFeatureExtractor.from_pretrained(save_dir)
processor = SeamlessM4TProcessor(feature_extractor=fe, tokenizer=tokenizer)
processor.save_pretrained(save_dir)
processor.push_to_hub(repo_id=repo_id, create_pr=True)
processor = SeamlessM4TProcessor.from_pretrained(save_dir)
######## Model
# init model
hf_config = _load_hf_config(model_type)
hf_model = SeamlessM4TModel(hf_config)
hf_model.generation_config.__setattr__("text_decoder_lang_to_code_id", text_decoder_lang_code_to_id)
hf_model.generation_config.__setattr__("t2u_lang_code_to_id", t2u_lang_code_to_id)
hf_model.generation_config.__setattr__("vocoder_lang_code_to_id", vocoder_lang_code_to_id)
# -1. take care of vocoder
# similarly to speech T5 must apply and remove weight norm
hf_model.vocoder.apply_weight_norm()
hf_model.vocoder = _convert_model(
original_model,
hf_model.vocoder,
vocoder_convert_list,
device,
unwanted_prefix="vocoder.code_generator.",
filter_state_dict="vocoder",
)
hf_model.vocoder.remove_weight_norm()
# 1. take care of speech encoder
wav2vec = hf_model.speech_encoder
hf_model.speech_encoder = _convert_model(
original_model, wav2vec, wav2vec_convert_list, device, unwanted_prefix="model.", filter_state_dict="speech"
)
# 2. take care of t2u
hf_model.t2u_model = _convert_model(
original_model,
hf_model.t2u_model,
t2u_convert_list,
device,
unwanted_prefix="model.",
filter_state_dict="t2u_model",
)
# 3. take care of text encoder
hf_model.text_encoder = _convert_model(
original_model,
hf_model.text_encoder,
text_convert_list,
device,
unwanted_prefix="model.",
filter_state_dict=["model.text_encoder"],
exclude_state_dict="t2u_model",
)
# 4. take care of text decoder
hf_model.text_decoder = _convert_model(
original_model,
hf_model.text_decoder,
text_convert_list,
device,
unwanted_prefix="model.",
filter_state_dict=["model.text_decoder"],
exclude_state_dict="t2u_model",
)
# 5. take care of final proj
hf_model.lm_head = _convert_model(
original_model,
hf_model.lm_head,
[("final_proj.", "")],
device,
unwanted_prefix="model.",
filter_state_dict=["model.final_proj"],
exclude_state_dict="t2u_model",
)
# sanity check
print(find_tied_parameters(hf_model))
count_1 = param_count(hf_model)
count_2 = param_count(original_model)
print(f"HF MODEL:{count_1}, ORIGINAL_MODEL: {count_2}, diff:{count_1 - count_2}")
print(f"HF MODEL excluding embeddings:{hf_model.num_parameters(exclude_embeddings=True)}")
del original_model
hf_model.generation_config._from_model_config = False
hf_model.save_pretrained(save_dir)
hf_model.push_to_hub(repo_id=repo_id, create_pr=True)
hf_model = SeamlessM4TModel.from_pretrained(save_dir)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument(
"--model_type",
default="medium",
type=str,
help="Model type.",
)
parser.add_argument(
"--save_dir",
default="/home/ubuntu/weights",
type=str,
help="Path to the output PyTorch model.",
)
parser.add_argument(
"--repo_id",
default="facebook/hf-seamless-m4t-medium",
type=str,
help="Repo ID.",
)
args = parser.parse_args()
load_model(args.save_dir, args.model_type, args.repo_id)
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Feature extractor class for SeamlessM4T
"""
import copy
from typing import List, Optional, Union
import numpy as np
from ...audio_utils import mel_filter_bank, spectrogram, window_function
from ...feature_extraction_sequence_utils import SequenceFeatureExtractor
from ...feature_extraction_utils import BatchFeature
from ...utils import PaddingStrategy, TensorType, logging
logger = logging.get_logger(__name__)
class SeamlessM4TFeatureExtractor(SequenceFeatureExtractor):
r"""
Constructs a SeamlessM4T feature extractor.
This feature extractor inherits from [`SequenceFeatureExtractor`] which contains most of the main methods. Users
should refer to this superclass for more information regarding those methods.
This class extracts mel-filter bank features from raw speech.
Args:
feature_size (`int`, *optional*, defaults to 80):
The feature dimension of the extracted features.
sampling_rate (`int`, *optional*, defaults to 16000):
The sampling rate at which the audio files should be digitalized expressed in hertz (Hz).
num_mel_bins (`int`, *optional*, defaults to 80):
Number of Mel-frequency bins.
padding_value (`float`, *optional*, defaults to 0.0):
The value that is used to fill the padding vectors.
stride (`int`, *optional*, defaults to 2):
Stride used to reshape audios from shape (batch_size,num_frames,num_mel_bins) to
(batch_size,num_frames//stride,num_mel_bins*stride).
"""
model_input_names = ["input_features", "attention_mask"]
def __init__(
self,
feature_size=80,
sampling_rate=16000,
num_mel_bins=80,
padding_value=0.0,
stride=2,
**kwargs,
):
self.num_mel_bins = num_mel_bins
self.return_attention_mask = True
self.stride = stride
mel_filters = mel_filter_bank(
num_frequency_bins=256,
num_mel_filters=self.num_mel_bins,
min_frequency=20,
max_frequency=sampling_rate // 2,
sampling_rate=sampling_rate,
norm=None,
mel_scale="kaldi",
triangularize_in_mel_space=True,
)
self.mel_filters = np.pad(mel_filters, ((0, 1), (0, 0)))
self.window = window_function(400, "povey", periodic=False)
super().__init__(feature_size=feature_size, sampling_rate=sampling_rate, padding_value=padding_value, **kwargs)
@staticmethod
# Copied from transformers.models.wav2vec2.feature_extraction_wav2vec2.Wav2Vec2FeatureExtractor.zero_mean_unit_var_norm
def zero_mean_unit_var_norm(
input_values: List[np.ndarray], attention_mask: List[np.ndarray], padding_value: float = 0.0
) -> List[np.ndarray]:
"""
Every array in the list is normalized to have zero mean and unit variance
"""
if attention_mask is not None:
attention_mask = np.array(attention_mask, np.int32)
normed_input_values = []
for vector, length in zip(input_values, attention_mask.sum(-1)):
normed_slice = (vector - vector[:length].mean()) / np.sqrt(vector[:length].var() + 1e-7)
if length < normed_slice.shape[0]:
normed_slice[length:] = padding_value
normed_input_values.append(normed_slice)
else:
normed_input_values = [(x - x.mean()) / np.sqrt(x.var() + 1e-7) for x in input_values]
return normed_input_values
def _extract_fbank_features(
self,
waveform: np.ndarray,
) -> np.ndarray:
"""
Get mel-filter bank features using TorchAudio. Note that TorchAudio requires 16-bit signed integers as inputs
and hence the waveform should not be normalized before feature extraction.
"""
# by default, it extracts the left channel if stereo
if len(waveform.shape) == 2:
waveform = waveform[0]
waveform = np.squeeze(waveform) * (2**15) # Kaldi compliance: 16-bit signed integers
features = spectrogram(
waveform,
self.window,
frame_length=400,
hop_length=160,
fft_length=512,
power=2.0,
center=False,
preemphasis=0.97,
mel_filters=self.mel_filters,
log_mel="log",
mel_floor=1.192092955078125e-07,
remove_dc_offset=True,
).T
return features
def __call__(
self,
raw_speech: Union[np.ndarray, List[float], List[np.ndarray], List[List[float]]],
padding: Union[bool, str, PaddingStrategy] = True,
pad_to_multiple_of: Optional[int] = 2,
max_length: Optional[int] = None,
truncation: bool = False,
return_tensors: Optional[Union[str, TensorType]] = None,
sampling_rate: Optional[int] = None,
return_attention_mask: Optional[bool] = None,
do_normalize_per_mel_bins: Optional[bool] = True,
**kwargs,
) -> BatchFeature:
"""
Main method to featurize and prepare for the model one or several sequence(s).
Args:
raw_speech (`np.ndarray`, `List[float]`, `List[np.ndarray]`, `List[List[float]]`, `List[List[List[float]]]`):
The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a list of float
values, a list of numpy arrays, a list of list of float values or a list of a list of list of float
values. If `raw_speech` is a one-dimensional `np.ndarray` or a `List[float]`, `raw_speech` is
considered a single-channel, single-sample sound. In all other cases, the first dimension of
`raw_speech`, whether from an `np.ndarray` or a `List[...]`, corresponds to the number of samples in
the batch, and the number of channels (i.e. mono or stereo character) is derived from the other
dimensions (1D -> single-channel waveform batches; 2D-> stereo-channel waveform batches).
padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):
Select a strategy to pad the returned sequences (according to the model's padding side and padding
index) among:
- `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
sequence if provided).
- `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
acceptable input length for the model if that argument is not provided.
- `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
lengths).
pad_to_multiple_of (`int`, *optional*, defaults to 2):
If set will pad the sequence to a multiple of the provided value.
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
`>= 7.5` (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128.
max_length (`int`, *optional*):
Maximum length of the returned list and optionally padding length (see above).
truncation (`bool`):
Activates truncation to cut input sequences longer than *max_length* to *max_length*.
return_attention_mask (`bool`, *optional*):
Whether to return the attention mask. If left to the default, will return the attention mask according
to the specific feature_extractor's default.
[What are attention masks?](../glossary#attention-mask)
<Tip>
For SeamlessM4T models, `attention_mask` should always be passed for batched inference, to avoid subtle
bugs.
</Tip>
return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors instead of list of python integers. Acceptable values are:
- `'tf'`: Return TensorFlow `tf.constant` objects.
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return Numpy `np.ndarray` objects.
sampling_rate (`int`, *optional*):
The sampling rate at which the `raw_speech` input was sampled. It is strongly recommended to pass
`sampling_rate` at the forward call to prevent silent errors.
do_normalize_per_mel_bins (`bool`, *optional*, defaults to `True`):
Whether or not to zero-mean unit-variance normalize the input per mel-channel.
kwargs (*optional*):
Remaining dictionary of keyword arguments that will be passed to the tokenizer or the feature
extractor.
"""
if sampling_rate is not None:
if sampling_rate != self.sampling_rate:
raise ValueError(
f"The model corresponding to this feature extractor: {self} was trained using a sampling rate of"
f" {self.sampling_rate}. Please make sure that the provided `raw_speech` input was sampled with"
f" {self.sampling_rate} and not {sampling_rate}."
)
else:
logger.warning(
"It is strongly recommended to pass the `sampling_rate` argument to this function. "
"Failing to do so can result in silent errors that might be hard to debug."
)
is_batched_numpy = isinstance(raw_speech, np.ndarray) and len(raw_speech.shape) > 1
if is_batched_numpy and len(raw_speech.shape) > 3:
raise ValueError(f"Only mono-channel or stereo-channel audio is supported for input to {self}")
is_batched = is_batched_numpy or (
isinstance(raw_speech, (list, tuple)) and (isinstance(raw_speech[0], (np.ndarray, tuple, list)))
)
if is_batched:
raw_speech = [np.asarray(speech, dtype=np.float32) for speech in raw_speech]
elif not is_batched and not isinstance(raw_speech, np.ndarray):
raw_speech = np.asarray(raw_speech, dtype=np.float32)
elif isinstance(raw_speech, np.ndarray) and raw_speech.dtype is np.dtype(np.float64):
raw_speech = raw_speech.astype(np.float32)
# always return batch
if not is_batched:
raw_speech = [raw_speech]
# extract fbank features
features = [self._extract_fbank_features(waveform) for waveform in raw_speech]
if do_normalize_per_mel_bins:
# torch defaults to ddof=1, and numpy defaults to ddof=0
features = [
(x - np.expand_dims(x.mean(0), 0)) / np.sqrt(np.expand_dims(x.var(0, ddof=1), 0) + 1e-7)
for x in features
]
# convert into correct format for padding
encoded_inputs = BatchFeature({"input_features": features})
padded_inputs = self.pad(
encoded_inputs,
padding=padding,
max_length=max_length,
truncation=truncation,
pad_to_multiple_of=pad_to_multiple_of,
return_attention_mask=return_attention_mask,
return_tensors="np",
)
# SeamlessM4T needs to process extracted features
input_features = padded_inputs.get("input_features")
attention_mask = padded_inputs.get("attention_mask")
batch_size, num_frames, num_channels = input_features.shape
remainder = num_frames % self.stride
if remainder != 0:
input_features = input_features[:, :num_frames, :]
attention_mask = attention_mask[:, :num_frames]
input_features = np.reshape(
input_features, (batch_size, num_frames // self.stride, num_channels * self.stride)
)
indices = np.arange(0, num_frames)
attention_mask = attention_mask[:, indices % self.stride == 1]
padded_inputs["input_features"] = input_features
padded_inputs["attention_mask"] = attention_mask
if return_tensors is not None:
padded_inputs = padded_inputs.convert_to_tensors(return_tensors)
return padded_inputs
def to_dict(self):
"""
Serializes this instance to a Python dictionary.
Returns:
`Dict[str, Any]`: Dictionary of all the attributes that make up this configuration instance.
"""
output = copy.deepcopy(self.__dict__)
output["feature_extractor_type"] = self.__class__.__name__
if "mel_filters" in output:
del output["mel_filters"]
if "window" in output:
del output["window"]
return output
This source diff could not be displayed because it is too large. You can view the blob instead.
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Audio/Text processor class for SeamlessM4T
"""
from ...processing_utils import ProcessorMixin
class SeamlessM4TProcessor(ProcessorMixin):
r"""
Constructs a SeamlessM4T processor which wraps a SeamlessM4T feature extractor and a SeamlessM4T tokenizer into a
single processor.
[`SeamlessM4TProcessor`] offers all the functionalities of [`SeamlessM4TFeatureExtractor`] and
[`SeamlessM4TTokenizerFast`]. See the [`~SeamlessM4TProcessor.__call__`] and [`~SeamlessM4TProcessor.decode`] for
more information.
Args:
feature_extractor ([`SeamlessM4TFeatureExtractor`]):
The audio processor is a required input.
tokenizer ([`SeamlessM4TTokenizerFast`]):
The tokenizer is a required input.
"""
feature_extractor_class = "SeamlessM4TFeatureExtractor"
tokenizer_class = ("SeamlessM4TTokenizer", "SeamlessM4TTokenizerFast")
def __init__(self, feature_extractor, tokenizer):
super().__init__(feature_extractor, tokenizer)
def __call__(self, text=None, audios=None, src_lang=None, tgt_lang=None, **kwargs):
"""
Main method to prepare for the model one or several sequences(s) and audio(s). This method forwards the `text`
and `kwargs` arguments to SeamlessM4TTokenizerFast's [`~SeamlessM4TTokenizerFast.__call__`] if `text` is not
`None` to encode the text. To prepare the audio(s), this method forwards the `audios` and `kwrags` arguments to
SeamlessM4TFeatureExtractor's [`~SeamlessM4TFeatureExtractor.__call__`] if `audios` is not `None`. Please refer
to the doctsring of the above two methods for more information.
Args:
text (`str`, `List[str]`, `List[List[str]]`):
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
audios (`np.ndarray`, `torch.Tensor`, `List[np.ndarray]`, `List[torch.Tensor]`):
The audio or batch of audios to be prepared. Each audio can be NumPy array or PyTorch tensor. In case
of a NumPy array/PyTorch tensor, each audio should be of shape (C, T), where C is a number of channels,
and T the sample length of the audio.
src_lang (`str`, *optional*):
The language code of the input texts/audios. If not specified, the last `src_lang` specified will be
used.
tgt_lang (`str`, *optional*):
The code of the target language. If not specified, the last `tgt_lang` specified will be used.
kwargs (*optional*):
Remaining dictionary of keyword arguments that will be passed to the feature extractor and/or the
tokenizer.
Returns:
[`BatchEncoding`]: A [`BatchEncoding`] with the following fields:
- **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
`return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
`None`).
- **input_features** -- Audio input features to be fed to a model. Returned when `audios` is not `None`.
"""
sampling_rate = kwargs.pop("sampling_rate", None)
if text is None and audios is None:
raise ValueError("You have to specify either text or audios. Both cannot be none.")
elif text is not None and audios is not None:
raise ValueError(
"Text and audios are mututally exclusive when passed to `SeamlessM4T`. Specify one or another."
)
elif text is not None:
if tgt_lang is not None:
self.tokenizer.tgt_lang = tgt_lang
if src_lang is not None:
self.tokenizer.src_lang = src_lang
encoding = self.tokenizer(text, **kwargs)
return encoding
else:
encoding = self.feature_extractor(audios, sampling_rate=sampling_rate, **kwargs)
return encoding
def batch_decode(self, *args, **kwargs):
"""
This method forwards all its arguments to SeamlessM4TTokenizerFast's [`~PreTrainedTokenizer.batch_decode`].
Please refer to the docstring of this method for more information.
"""
return self.tokenizer.batch_decode(*args, **kwargs)
def decode(self, *args, **kwargs):
"""
This method forwards all its arguments to SeamlessM4TTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please
refer to the docstring of this method for more information.
"""
return self.tokenizer.decode(*args, **kwargs)
@property
def model_input_names(self):
tokenizer_input_names = self.tokenizer.model_input_names
feature_extractor_input_names = self.feature_extractor.model_input_names
return list(dict.fromkeys(tokenizer_input_names + feature_extractor_input_names))
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tokenization classes for SeamlessM4T."""
import os
from shutil import copyfile
from typing import Any, Dict, List, Optional, Tuple, Union
import sentencepiece as spm
from ...convert_slow_tokenizer import import_protobuf
from ...tokenization_utils import (
BatchEncoding,
PreTokenizedInput,
PreTrainedTokenizer,
TextInput,
)
from ...tokenization_utils_base import AddedToken
from ...utils import PaddingStrategy, logging
logger = logging.get_logger(__name__)
PRETRAINED_VOCAB_FILES_MAP = {
"vocab_file": {
"facebook/hf-seamless-m4t-medium": (
"https://huggingface.co/facebook/hf-seamless-m4t-medium/blob/main/sentencepiece.bpe.model"
),
}
}
SPIECE_UNDERLINE = "▁"
VOCAB_FILES_NAMES = {"vocab_file": "sentencepiece.bpe.model"}
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
"facebook/hf-seamless-m4t-medium": 2048,
}
class SeamlessM4TTokenizer(PreTrainedTokenizer):
"""
Construct a SeamlessM4T tokenizer.
Adapted from [`RobertaTokenizer`] and [`XLNetTokenizer`]. Based on
[SentencePiece](https://github.com/google/sentencepiece).
The tokenization method is `<language code> <tokens> <eos>` for source language documents, and `<eos> <language
code> <tokens> <eos>` for target language documents.
Examples:
```python
>>> from transformers import SeamlessM4TTokenizer
>>> tokenizer = SeamlessM4TTokenizer.from_pretrained(
... "facebook/hf-seamless-m4t-medium", src_lang="eng", tgt_lang="fra"
... )
>>> example_english_phrase = " UN Chief Says There Is No Military Solution in Syria"
>>> expected_translation_french = "Le chef de l'ONU affirme qu'il n'y a pas de solution militaire en Syrie."
>>> inputs = tokenizer(example_english_phrase, text_target=expected_translation_french, return_tensors="pt")
```
Args:
vocab_file (`str`):
Path to the vocabulary file.
bos_token (`str`, *optional*, defaults to `"<s>"`):
The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
<Tip>
When building a sequence using special tokens, this is not the token that is used for the beginning of
sequence. The token used is the `cls_token`.
</Tip>
eos_token (`str`, *optional*, defaults to `"</s>"`):
The end of sequence token.
<Tip>
When building a sequence using special tokens, this is not the token that is used for the end of sequence.
The token used is the `sep_token`.
</Tip>
sep_token (`str`, *optional*, defaults to `"</s>"`):
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
sequence classification or for a text and a question for question answering. It is also used as the last
token of a sequence built with special tokens.
cls_token (`str`, *optional*, defaults to `"<s>"`):
The classifier token which is used when doing sequence classification (classification of the whole sequence
instead of per-token classification). It is the first token of the sequence when built with special tokens.
unk_token (`str`, *optional*, defaults to `"<unk>"`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
pad_token (`str`, *optional*, defaults to `"<pad>"`):
The token used for padding, for example when batching sequences of different lengths.
tokenizer_file (`str`, *optional*):
The path to a tokenizer file to use instead of the vocab file.
src_lang (`str`, *optional*, defaults to `"eng"`):
The language to use as source language for translation.
tgt_lang (`str`, *optional*, defaults to `"fra"`):
The language to use as target language for translation.
sp_model_kwargs (`Dict[str, Any]`, *optional*):
Additional keyword arguments to pass to the model initialization.
additional_special_tokens (tuple or list of `str` or `tokenizers.AddedToken`, *optional*):
A tuple or a list of additional special tokens. Can be used to specify the list of languages that will be
supported by the tokenizer.
"""
vocab_files_names = VOCAB_FILES_NAMES
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
model_input_names = ["input_ids", "attention_mask"]
prefix_tokens: List[int] = []
suffix_tokens: List[int] = []
def __init__(
self,
vocab_file,
bos_token="<s>",
eos_token="</s>",
sep_token="</s>",
cls_token="<s>",
unk_token="<unk>",
pad_token="<pad>",
tokenizer_file=None,
src_lang="eng",
tgt_lang="fra",
sp_model_kwargs: Optional[Dict[str, Any]] = None,
additional_special_tokens=None,
**kwargs,
):
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
# Add this unused argument to keep some important Copied from statements
self.legacy = False
self.vocab_file = vocab_file
self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
# Vocab | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
# -------- | ------- | ------- | ------ | ------- | ---- | ---- | ---- | ---- | ---- | ----
# spm | '<unk>' | '<s>' | '</s>' | 'an' | 'en' | '_d' | 'er' | 'in' | '_s' | '_a'
# fairseq | '<pad>' | '<unk>' | '<s>' | '</s>' | 'an' | 'en' | '▁d' | 'er' | 'in' | '▁s'
# Mimic fairseq token-to-id alignment for the first 4 token
self._added_tokens_decoder = {
0: AddedToken(pad_token, special=True) if isinstance(pad_token, str) else pad_token,
1: AddedToken(unk_token, special=True) if isinstance(unk_token, str) else unk_token,
2: AddedToken(bos_token, special=True) if isinstance(bos_token, str) else bos_token,
3: AddedToken(eos_token, special=True) if isinstance(eos_token, str) else eos_token,
}
# The first "real" token "an" has position 4 in the original fairseq vocab and position 3 in the spm vocab
self.fairseq_offset = 1
self.sp_model_size = len(self.sp_model)
self._src_lang = f"__{src_lang}__" if "__" not in src_lang else src_lang
self._tgt_lang = f"__{tgt_lang}__" if "__" not in tgt_lang else tgt_lang
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
tokenizer_file=tokenizer_file,
src_lang=src_lang,
tgt_lang=tgt_lang,
additional_special_tokens=additional_special_tokens,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)
self.set_src_lang_special_tokens(self._src_lang)
self.set_tgt_lang_special_tokens(self._tgt_lang)
# Copied from transformers.models.nllb.tokenization_nllb.NllbTokenizer.__getstate__
def __getstate__(self):
state = self.__dict__.copy()
state["sp_model"] = None
state["sp_model_proto"] = self.sp_model.serialized_model_proto()
return state
# Copied from transformers.models.nllb.tokenization_nllb.NllbTokenizer.__setstate__
def __setstate__(self, d):
self.__dict__ = d
# for backward compatibility
if not hasattr(self, "sp_model_kwargs"):
self.sp_model_kwargs = {}
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.LoadFromSerializedProto(self.sp_model_proto)
@property
def vocab_size(self):
return len(self.sp_model)
def __call__(
self,
text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
text_pair: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] = None,
text_target: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
text_pair_target: Optional[
Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]
] = None,
padding: Union[bool, str, PaddingStrategy] = True,
pad_to_multiple_of: Optional[int] = 2,
src_lang: Optional[str] = None,
tgt_lang: Optional[str] = None,
**kwargs,
):
"""
Args:
text (`str`, `List[str]`, `List[List[str]]`, *optional*):
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
text_pair (`str`, `List[str]`, `List[List[str]]`, *optional*):
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
text_target (`str`, `List[str]`, `List[List[str]]`, *optional*):
The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a
list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized),
you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
text_pair_target (`str`, `List[str]`, `List[List[str]]`, *optional*):
The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a
list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized),
you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):
Select a strategy to pad the returned sequences (according to the model's padding side and padding
index) among:
- `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
sequence if provided).
- `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
acceptable input length for the model if that argument is not provided.
- `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
lengths).
pad_to_multiple_of (`int`, *optional*):
If set will pad the sequence to a multiple of the provided value.
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
`>= 7.5` (Volta).
src_lang (`str`, *optional*):
A string representing the source language. If not specified, the last `src_lang` specified (either
during initialization or when calling this tokenizer) will be used.
tgt_lang (`str`, *optional*):
A string representing the target language. If not specified, the last `tgt_lang` specified (either
during initialization or when calling this tokenizer) will be used.
kwargs (*optional*):
Remaining dictionary of keyword arguments that will be passed to [`PreTrainedTokenizer.__call__`].
"""
if src_lang is not None:
self.src_lang = src_lang
if tgt_lang is not None:
self.tgt_lang = tgt_lang
output = super().__call__(
text=text,
text_pair=text_pair,
text_target=text_target,
text_pair_target=text_pair_target,
padding=padding,
pad_to_multiple_of=pad_to_multiple_of,
**kwargs,
)
return BatchEncoding(output, tensor_type=kwargs.get("return_tensors"))
@property
# Copied from transformers.models.nllb.tokenization_nllb.NllbTokenizer.src_lang
def src_lang(self) -> str:
return self._src_lang
@src_lang.setter
def src_lang(self, new_src_lang: str) -> None:
if "__" not in new_src_lang:
self._src_lang = f"__{new_src_lang}__"
else:
self._src_lang = new_src_lang
self.set_src_lang_special_tokens(self._src_lang)
@property
def tgt_lang(self) -> str:
return self._tgt_lang
@tgt_lang.setter
def tgt_lang(self, new_tgt_lang: str) -> None:
if "__" not in new_tgt_lang:
self._tgt_lang = f"__{new_tgt_lang}__"
else:
self._tgt_lang = new_tgt_lang
self.set_tgt_lang_special_tokens(self._tgt_lang)
# Copied from transformers.models.nllb.tokenization_nllb.NllbTokenizer.get_special_tokens_mask
def get_special_tokens_mask(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
) -> List[int]:
"""
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer `prepare_for_model` method.
Args:
token_ids_0 (`List[int]`):
List of IDs.
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs.
already_has_special_tokens (`bool`, *optional*, defaults to `False`):
Whether or not the token list is already formatted with special tokens for the model.
Returns:
`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
"""
if already_has_special_tokens:
return super().get_special_tokens_mask(
token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
)
prefix_ones = [1] * len(self.prefix_tokens)
suffix_ones = [1] * len(self.suffix_tokens)
if token_ids_1 is None:
return prefix_ones + ([0] * len(token_ids_0)) + suffix_ones
return prefix_ones + ([0] * len(token_ids_0)) + ([0] * len(token_ids_1)) + suffix_ones
# Copied from transformers.models.nllb.tokenization_nllb.NllbTokenizer.build_inputs_with_special_tokens
def build_inputs_with_special_tokens(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
"""
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
adding special tokens. An NLLB sequence has the following format, where `X` represents the sequence:
- `input_ids` (for encoder) `X [eos, src_lang_code]`
- `decoder_input_ids`: (for decoder) `X [eos, tgt_lang_code]`
BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a
separator.
Args:
token_ids_0 (`List[int]`):
List of IDs to which the special tokens will be added.
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs.
Returns:
`List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
"""
if token_ids_1 is None:
return self.prefix_tokens + token_ids_0 + self.suffix_tokens
# We don't expect to process pairs, but leave the pair logic for API consistency
return self.prefix_tokens + token_ids_0 + token_ids_1 + self.suffix_tokens
# Copied from transformers.models.nllb.tokenization_nllb.NllbTokenizer.create_token_type_ids_from_sequences
def create_token_type_ids_from_sequences(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
"""
Create a mask from the two sequences passed to be used in a sequence-pair classification task. nllb does not
make use of token type ids, therefore a list of zeros is returned.
Args:
token_ids_0 (`List[int]`):
List of IDs.
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs.
Returns:
`List[int]`: List of zeros.
"""
sep = [self.sep_token_id]
cls = [self.cls_token_id]
if token_ids_1 is None:
return len(cls + token_ids_0 + sep) * [0]
return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]
def _build_translation_inputs(
self, raw_inputs, return_tensors: str, src_lang: Optional[str], tgt_lang: Optional[str], **extra_kwargs
):
"""Used by translation pipeline, to prepare inputs for the generate function"""
if src_lang is None or tgt_lang is None:
raise ValueError("Translation requires a `src_lang` and a `tgt_lang` for this model.")
self.src_lang = src_lang
inputs = self(raw_inputs, add_special_tokens=True, return_tensors=return_tensors, **extra_kwargs)
if "__" not in tgt_lang:
tgt_lang = f"__{tgt_lang}__"
tgt_lang_id = self.convert_tokens_to_ids(tgt_lang)
inputs["forced_bos_token_id"] = tgt_lang_id
return inputs
def get_vocab(self):
vocab = {
self.convert_ids_to_tokens(i): i for i in range(self.fairseq_offset, self.vocab_size + self.fairseq_offset)
}
vocab.update(self.added_tokens_encoder)
return vocab
@property
def unk_token_length(self):
return len(self.sp_model.encode(str(self.unk_token)))
# Copied from transformers.models.t5.tokenization_t5.T5Tokenizer.get_spm_processor
def get_spm_processor(self, from_slow=False):
tokenizer = spm.SentencePieceProcessor(**self.sp_model_kwargs)
if self.legacy or from_slow: # no dependency on protobuf
tokenizer.Load(self.vocab_file)
return tokenizer
with open(self.vocab_file, "rb") as f:
sp_model = f.read()
model_pb2 = import_protobuf(f"The new behaviour of {self.__class__.__name__} (with `self.legacy = False`)")
model = model_pb2.ModelProto.FromString(sp_model)
normalizer_spec = model_pb2.NormalizerSpec()
normalizer_spec.add_dummy_prefix = False
model.normalizer_spec.MergeFrom(normalizer_spec)
sp_model = model.SerializeToString()
tokenizer.LoadFromSerializedProto(sp_model)
return tokenizer
# Copied from transformers.models.t5.tokenization_t5.T5Tokenizer.tokenize
def tokenize(self, text: "TextInput", add_special_tokens=False, **kwargs) -> List[str]:
"""
Converts a string to a list of tokens. If `self.legacy` is set to `False`, a prefix token is added unless the
first token is special.
"""
if self.legacy or len(text) == 0:
return super().tokenize(text, **kwargs)
tokens = super().tokenize(SPIECE_UNDERLINE + text.replace(SPIECE_UNDERLINE, " "), **kwargs)
if len(tokens) > 1 and tokens[0] == SPIECE_UNDERLINE and tokens[1] in self.all_special_tokens:
tokens = tokens[1:]
return tokens
# Copied from transformers.models.t5.tokenization_t5.T5Tokenizer._tokenize
def _tokenize(self, text, **kwargs):
"""
Returns a tokenized string.
We de-activated the `add_dummy_prefix` option, thus the sentencepiece internals will always strip any
SPIECE_UNDERLINE. For example: `self.sp_model.encode(f"{SPIECE_UNDERLINE}Hey", out_type = str)` will give
`['H', 'e', 'y']` instead of `['▁He', 'y']`. Thus we always encode `f"{unk_token}text"` and strip the
`unk_token`. Here is an example with `unk_token = "<unk>"` and `unk_token_length = 4`.
`self.tokenizer.sp_model.encode("<unk> Hey", out_type = str)[4:]`.
"""
tokens = self.sp_model.encode(text, out_type=str)
if self.legacy or not text.startswith((SPIECE_UNDERLINE, " ")):
return tokens
# 1. Encode string + prefix ex: "<unk> Hey"
tokens = self.sp_model.encode(self.unk_token + text, out_type=str)
# 2. Remove self.unk_token from ['<','unk','>', '▁Hey']
return tokens[self.unk_token_length :] if len(tokens) >= self.unk_token_length else tokens
def _convert_token_to_id(self, token):
"""Converts a token (str) in an id using the vocab."""
spm_id = self.sp_model.PieceToId(token)
# Need to return unknown token if the SP model returned 0
return spm_id + self.fairseq_offset if spm_id else self.unk_token_id
def _convert_id_to_token(self, index):
"""Converts an index (integer) in a token (str) using the vocab."""
return self.sp_model.IdToPiece(index - self.fairseq_offset)
def convert_tokens_to_string(self, tokens):
"""Converts a sequence of tokens (strings for sub-words) in a single string."""
if tokens[0].startswith(SPIECE_UNDERLINE):
tokens[0] = tokens[0][1:]
out_string = "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip()
return out_string
# Copied from transformers.models.nllb.tokenization_nllb.NllbTokenizer.save_vocabulary
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
if not os.path.isdir(save_directory):
logger.error(f"Vocabulary path ({save_directory}) should be a directory")
return
out_vocab_file = os.path.join(
save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
)
if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
copyfile(self.vocab_file, out_vocab_file)
elif not os.path.isfile(self.vocab_file):
with open(out_vocab_file, "wb") as fi:
content_spiece_model = self.sp_model.serialized_model_proto()
fi.write(content_spiece_model)
return (out_vocab_file,)
# Copied from transformers.models.nllb.tokenization_nllb.NllbTokenizer.prepare_seq2seq_batch with eng_Latn->eng, fra_Latn->fra
def prepare_seq2seq_batch(
self,
src_texts: List[str],
src_lang: str = "eng",
tgt_texts: Optional[List[str]] = None,
tgt_lang: str = "fra",
**kwargs,
) -> BatchEncoding:
self.src_lang = src_lang
self.tgt_lang = tgt_lang
return super().prepare_seq2seq_batch(src_texts, tgt_texts, **kwargs)
# Copied from transformers.models.nllb.tokenization_nllb.NllbTokenizer._switch_to_input_mode
def _switch_to_input_mode(self):
return self.set_src_lang_special_tokens(self.src_lang)
# Copied from transformers.models.nllb.tokenization_nllb.NllbTokenizer._switch_to_target_mode
def _switch_to_target_mode(self):
return self.set_tgt_lang_special_tokens(self.tgt_lang)
def set_src_lang_special_tokens(self, src_lang) -> None:
"""Reset the special tokens to the source lang setting.
Prefix=[src_lang_code], suffix = [eos]
"""
self.cur_lang_code = self.convert_tokens_to_ids(src_lang)
self.init_kwargs["src_lang"] = src_lang
if self.cur_lang_code == self.unk_token_id:
logger.warning_once(
f"`src_lang={src_lang}` has not be found in the vocabulary. Behaviour will probably be unexpected because the language token id will be replaced by the unknown token id."
)
self.prefix_tokens = [self.cur_lang_code]
self.suffix_tokens = [self.eos_token_id]
# https://github.com/facebookresearch/fairseq2/blob/c53f18e6be6b8b46b722f2249b8397b7eccd7ad3/src/fairseq2/models/nllb/tokenizer.py#L112-L116
def set_tgt_lang_special_tokens(self, lang: str) -> None:
"""Reset the special tokens to the target lang setting.
Prefix=[eos, tgt_lang_code] and suffix=[eos].
"""
self.cur_lang_code = self.convert_tokens_to_ids(lang)
self.init_kwargs["tgt_lang"] = lang
if self.cur_lang_code == self.unk_token_id:
logger.warning_once(
f"`tgt_lang={lang}` has not be found in the vocabulary. Behaviour will probably be unexpected because the language token id will be replaced by the unknown token id."
)
self.prefix_tokens = [self.eos_token_id, self.cur_lang_code]
self.suffix_tokens = [self.eos_token_id]
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Fast Tokenization class for SeamlessM4T."""
import os
from shutil import copyfile
from typing import List, Optional, Tuple, Union
from tokenizers import processors
from ...tokenization_utils import (
BatchEncoding,
PreTokenizedInput,
TextInput,
)
from ...tokenization_utils_fast import PreTrainedTokenizerFast
from ...utils import PaddingStrategy, logging
from .tokenization_seamless_m4t import (
SeamlessM4TTokenizer,
)
logger = logging.get_logger(__name__)
VOCAB_FILES_NAMES = {"vocab_file": "sentencepiece.bpe.model", "tokenizer_file": "tokenizer.json"}
PRETRAINED_VOCAB_FILES_MAP = {
"vocab_file": {
"facebook/hf-seamless-m4t-medium": "https://huggingface.co/facebook/hf-seamless-m4t-medium/resolve/main/vocab.txt",
},
"tokenizer_file": {
"facebook/hf-seamless-m4t-medium": "https://huggingface.co/facebook/hf-seamless-m4t-medium/resolve/main/tokenizer.json",
},
}
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
"facebook/hf-seamless-m4t-medium": 2048,
}
class SeamlessM4TTokenizerFast(PreTrainedTokenizerFast):
"""
Construct a "fast" SeamlessM4T tokenizer (backed by HuggingFace's *tokenizers* library). Based on
[BPE](https://huggingface.co/docs/tokenizers/python/latest/components.html?highlight=BPE#models).
This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
refer to this superclass for more information regarding those methods.
The tokenization method is `<language code> <tokens> <eos>` for source language documents, and `<eos> <language
code> <tokens> <eos>` for target language documents.
Examples:
```python
>>> from transformers import SeamlessM4TTokenizerFast
>>> tokenizer = SeamlessM4TTokenizerFast.from_pretrained(
... "facebook/hf-seamless-m4t-medium", src_lang="eng", tgt_lang="fra"
... )
>>> example_english_phrase = " UN Chief Says There Is No Military Solution in Syria"
>>> expected_translation_french = "Le chef de l'ONU affirme qu'il n'y a pas de solution militaire en Syrie."
>>> inputs = tokenizer(example_english_phrase, text_target=expected_translation_french, return_tensors="pt")
```
Args:
vocab_file (`str`, *optional*):
Path to the vocabulary file.
tokenizer_file (`str`, *optional*):
The path to a tokenizer file to use instead of the vocab file.
bos_token (`str`, *optional*, defaults to `"<s>"`):
The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
<Tip>
When building a sequence using special tokens, this is not the token that is used for the beginning of
sequence. The token used is the `cls_token`.
</Tip>
eos_token (`str`, *optional*, defaults to `"</s>"`):
The end of sequence token.
<Tip>
When building a sequence using special tokens, this is not the token that is used for the end of sequence.
The token used is the `sep_token`.
</Tip>
sep_token (`str`, *optional*, defaults to `"</s>"`):
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
sequence classification or for a text and a question for question answering. It is also used as the last
token of a sequence built with special tokens.
cls_token (`str`, *optional*, defaults to `"<s>"`):
The classifier token which is used when doing sequence classification (classification of the whole sequence
instead of per-token classification). It is the first token of the sequence when built with special tokens.
unk_token (`str`, *optional*, defaults to `"<unk>"`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
pad_token (`str`, *optional*, defaults to `"<pad>"`):
The token used for padding, for example when batching sequences of different lengths.
src_lang (`str`, *optional*, defaults to `"eng"`):
The language to use as source language for translation.
tgt_lang (`str`, *optional*, defaults to `"fra"`):
The language to use as target language for translation.
additional_special_tokens (tuple or list of `str` or `tokenizers.AddedToken`, *optional*):
A tuple or a list of additional special tokens.
"""
vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
slow_tokenizer_class = SeamlessM4TTokenizer
model_input_names = ["input_ids", "attention_mask"]
prefix_tokens: List[int] = []
suffix_tokens: List[int] = []
def __init__(
self,
vocab_file=None,
tokenizer_file=None,
bos_token="<s>",
eos_token="</s>",
sep_token="</s>",
cls_token="<s>",
unk_token="<unk>",
pad_token="<pad>",
src_lang="eng",
tgt_lang="fra",
additional_special_tokens=None,
**kwargs,
):
super().__init__(
vocab_file=vocab_file,
tokenizer_file=tokenizer_file,
bos_token=bos_token,
eos_token=eos_token,
sep_token=sep_token,
cls_token=cls_token,
unk_token=unk_token,
pad_token=pad_token,
src_lang=src_lang,
tgt_lang=tgt_lang,
additional_special_tokens=additional_special_tokens,
**kwargs,
)
self.vocab_file = vocab_file
self._src_lang = f"__{src_lang}__" if "__" not in src_lang else src_lang
self._tgt_lang = f"__{tgt_lang}__" if "__" not in tgt_lang else tgt_lang
self.set_src_lang_special_tokens(self._src_lang)
self.set_tgt_lang_special_tokens(self._tgt_lang)
@property
def can_save_slow_tokenizer(self) -> bool:
return os.path.isfile(self.vocab_file) if self.vocab_file else False
@property
# Copied from transformers.models.nllb.tokenization_nllb.NllbTokenizer.src_lang
def src_lang(self) -> str:
return self._src_lang
@src_lang.setter
def src_lang(self, new_src_lang: str) -> None:
if "__" not in new_src_lang:
self._src_lang = f"__{new_src_lang}__"
else:
self._src_lang = new_src_lang
self.set_src_lang_special_tokens(self._src_lang)
@property
def tgt_lang(self) -> str:
return self._tgt_lang
@tgt_lang.setter
def tgt_lang(self, new_tgt_lang: str) -> None:
if "__" not in new_tgt_lang:
self._tgt_lang = f"__{new_tgt_lang}__"
else:
self._tgt_lang = new_tgt_lang
self.set_tgt_lang_special_tokens(self._tgt_lang)
def build_inputs_with_special_tokens(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
"""
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
adding special tokens. The special tokens depend on calling set_lang.
An SeamlessM4T sequence has the following format, where `X` represents the sequence:
- `input_ids` (for encoder) `[src_lang_code] X [eos]`
- `decoder_input_ids`: (for decoder) `[eos, tgt_lang_code] X [eos]`
BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a
separator.
Args:
token_ids_0 (`List[int]`):
List of IDs to which the special tokens will be added.
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs.
Returns:
`List[int]`: list of [input IDs](../glossary#input-ids) with the appropriate special tokens.
"""
if token_ids_1 is None:
return self.prefix_tokens + token_ids_0 + self.suffix_tokens
# We don't expect to process pairs, but leave the pair logic for API consistency
return self.prefix_tokens + token_ids_0 + token_ids_1 + self.suffix_tokens
# Copied from transformers.models.nllb.tokenization_nllb_fast.NllbTokenizerFast.create_token_type_ids_from_sequences
def create_token_type_ids_from_sequences(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
"""
Create a mask from the two sequences passed to be used in a sequence-pair classification task. nllb does not
make use of token type ids, therefore a list of zeros is returned.
Args:
token_ids_0 (`List[int]`):
List of IDs.
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs.
Returns:
`List[int]`: List of zeros.
"""
sep = [self.sep_token_id]
cls = [self.cls_token_id]
if token_ids_1 is None:
return len(cls + token_ids_0 + sep) * [0]
return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]
def _build_translation_inputs(
self, raw_inputs, return_tensors: str, src_lang: Optional[str], tgt_lang: Optional[str], **extra_kwargs
):
"""Used by translation pipeline, to prepare inputs for the generate function"""
if src_lang is None or tgt_lang is None:
raise ValueError("Translation requires a `src_lang` and a `tgt_lang` for this model")
self.src_lang = src_lang
inputs = self(raw_inputs, add_special_tokens=True, return_tensors=return_tensors, **extra_kwargs)
if "__" not in tgt_lang:
tgt_lang = f"__{tgt_lang}__"
tgt_lang_id = self.convert_tokens_to_ids(tgt_lang)
inputs["forced_bos_token_id"] = tgt_lang_id
return inputs
# Copied from transformers.models.nllb.tokenization_nllb_fast.NllbTokenizerFast.prepare_seq2seq_batch with "fra_Latn"->"fra", "eng_Latn"->"eng"
def prepare_seq2seq_batch(
self,
src_texts: List[str],
src_lang: str = "eng",
tgt_texts: Optional[List[str]] = None,
tgt_lang: str = "fra",
**kwargs,
) -> BatchEncoding:
self.src_lang = src_lang
self.tgt_lang = tgt_lang
return super().prepare_seq2seq_batch(src_texts, tgt_texts, **kwargs)
# Copied from transformers.models.nllb.tokenization_nllb_fast.NllbTokenizerFast._switch_to_input_mode
def _switch_to_input_mode(self):
return self.set_src_lang_special_tokens(self.src_lang)
# Copied from transformers.models.nllb.tokenization_nllb_fast.NllbTokenizerFast._switch_to_target_mode
def _switch_to_target_mode(self):
return self.set_tgt_lang_special_tokens(self.tgt_lang)
def set_src_lang_special_tokens(self, src_lang) -> None:
"""Reset the special tokens to the source lang setting.
Prefix=[src_lang_code], suffix = [eos]
"""
self.cur_lang_code = self.convert_tokens_to_ids(src_lang)
if self.cur_lang_code == self.unk_token_id:
logger.warning_once(
f"`tgt_lang={src_lang}` has not be found in the `vocabulary`. Behaviour will probably be unexpected because the language token id will be replaced by the unknown token id."
)
self.init_kwargs["src_lang"] = src_lang
self.prefix_tokens = [self.cur_lang_code]
self.suffix_tokens = [self.eos_token_id]
prefix_tokens_str = self.convert_ids_to_tokens(self.prefix_tokens)
suffix_tokens_str = self.convert_ids_to_tokens(self.suffix_tokens)
self._tokenizer.post_processor = processors.TemplateProcessing(
single=prefix_tokens_str + ["$A"] + suffix_tokens_str,
pair=prefix_tokens_str + ["$A", "$B"] + suffix_tokens_str,
special_tokens=list(zip(prefix_tokens_str + suffix_tokens_str, self.prefix_tokens + self.suffix_tokens)),
)
def set_tgt_lang_special_tokens(self, lang: str) -> None:
"""Reset the special tokens to the target lang setting.
Prefix=[eos, tgt_lang_code] and suffix=[eos].
"""
self.cur_lang_code = self.convert_tokens_to_ids(lang)
if self.cur_lang_code == self.unk_token_id:
logger.warning_once(
f"`tgt_lang={lang}` has not be found in the `vocabulary`. Behaviour will probably be unexpected because the language token id will be replaced by the unknown token id."
)
self.init_kwargs["tgt_lang"] = lang
self.prefix_tokens = [self.eos_token_id, self.cur_lang_code]
self.suffix_tokens = [self.eos_token_id]
prefix_tokens_str = self.convert_ids_to_tokens(self.prefix_tokens)
suffix_tokens_str = self.convert_ids_to_tokens(self.suffix_tokens)
self._tokenizer.post_processor = processors.TemplateProcessing(
single=prefix_tokens_str + ["$A"] + suffix_tokens_str,
pair=prefix_tokens_str + ["$A", "$B"] + suffix_tokens_str,
special_tokens=list(zip(prefix_tokens_str + suffix_tokens_str, self.prefix_tokens + self.suffix_tokens)),
)
# Copied from transformers.models.nllb.tokenization_nllb_fast.NllbTokenizerFast.save_vocabulary
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
if not self.can_save_slow_tokenizer:
raise ValueError(
"Your fast tokenizer does not have the necessary information to save the vocabulary for a slow "
"tokenizer."
)
if not os.path.isdir(save_directory):
logger.error(f"Vocabulary path ({save_directory}) should be a directory.")
return
out_vocab_file = os.path.join(
save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
)
if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):
copyfile(self.vocab_file, out_vocab_file)
return (out_vocab_file,)
@classmethod
def _from_pretrained(
cls,
resolved_vocab_files,
pretrained_model_name_or_path,
init_configuration,
*init_inputs,
token=None,
cache_dir=None,
local_files_only=False,
_commit_hash=None,
_is_local=False,
**kwargs,
):
tokenizer = super()._from_pretrained(
resolved_vocab_files,
pretrained_model_name_or_path,
init_configuration,
*init_inputs,
token=token,
cache_dir=cache_dir,
local_files_only=local_files_only,
_commit_hash=_commit_hash,
_is_local=_is_local,
**kwargs,
)
# ensure also set after from pretrained
tokenizer.set_src_lang_special_tokens(tokenizer._src_lang)
tokenizer.set_tgt_lang_special_tokens(tokenizer._tgt_lang)
return tokenizer
def __call__(
self,
text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
text_pair: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] = None,
text_target: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
text_pair_target: Optional[
Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]
] = None,
padding: Union[bool, str, PaddingStrategy] = True,
pad_to_multiple_of: Optional[int] = 2,
src_lang: Optional[str] = None,
tgt_lang: Optional[str] = None,
**kwargs,
):
"""
Args:
text (`str`, `List[str]`, `List[List[str]]`, *optional*):
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
text_pair (`str`, `List[str]`, `List[List[str]]`, *optional*):
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
text_target (`str`, `List[str]`, `List[List[str]]`, *optional*):
The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a
list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized),
you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
text_pair_target (`str`, `List[str]`, `List[List[str]]`, *optional*):
The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a
list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized),
you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):
Select a strategy to pad the returned sequences (according to the model's padding side and padding
index) among:
- `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
sequence if provided).
- `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
acceptable input length for the model if that argument is not provided.
- `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
lengths).
pad_to_multiple_of (`int`, *optional*):
If set will pad the sequence to a multiple of the provided value.
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
`>= 7.5` (Volta).
src_lang (`str`, *optional*):
A string representing the source language. If not specified, the last `src_lang` specified (either
during initialization or when calling this tokenizer) will be used.
tgt_lang (`str`, *optional*):
A string representing the target language. If not specified, the last `tgt_lang` specified (either
during initialization or when calling this tokenizer) will be used.
kwargs (*optional*):
Remaining dictionary of keyword arguments that will be passed to [`PreTrainedTokenizerFast.__call__`].
"""
if src_lang is not None:
self.src_lang = src_lang
if tgt_lang is not None:
self.tgt_lang = tgt_lang
output = super().__call__(
text=text,
text_pair=text_pair,
text_target=text_target,
text_pair_target=text_pair_target,
padding=padding,
pad_to_multiple_of=pad_to_multiple_of,
**kwargs,
)
return output
...@@ -6933,6 +6933,79 @@ class SamPreTrainedModel(metaclass=DummyObject): ...@@ -6933,6 +6933,79 @@ class SamPreTrainedModel(metaclass=DummyObject):
requires_backends(self, ["torch"]) requires_backends(self, ["torch"])
SEAMLESS_M4T_PRETRAINED_MODEL_ARCHIVE_LIST = None
class SeamlessM4TCodeHifiGan(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class SeamlessM4TForSpeechToSpeech(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class SeamlessM4TForSpeechToText(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class SeamlessM4TForTextToSpeech(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class SeamlessM4TForTextToText(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class SeamlessM4THifiGan(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class SeamlessM4TModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class SeamlessM4TPreTrainedModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class SeamlessM4TTextToUnitForConditionalGeneration(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class SeamlessM4TTextToUnitModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
SEGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = None SEGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = None
......
...@@ -177,6 +177,13 @@ class RemBertTokenizer(metaclass=DummyObject): ...@@ -177,6 +177,13 @@ class RemBertTokenizer(metaclass=DummyObject):
requires_backends(self, ["sentencepiece"]) requires_backends(self, ["sentencepiece"])
class SeamlessM4TTokenizer(metaclass=DummyObject):
_backends = ["sentencepiece"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["sentencepiece"])
class Speech2TextTokenizer(metaclass=DummyObject): class Speech2TextTokenizer(metaclass=DummyObject):
_backends = ["sentencepiece"] _backends = ["sentencepiece"]
......
...@@ -366,6 +366,13 @@ class RoFormerTokenizerFast(metaclass=DummyObject): ...@@ -366,6 +366,13 @@ class RoFormerTokenizerFast(metaclass=DummyObject):
requires_backends(self, ["tokenizers"]) requires_backends(self, ["tokenizers"])
class SeamlessM4TTokenizerFast(metaclass=DummyObject):
_backends = ["tokenizers"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["tokenizers"])
class SplinterTokenizerFast(metaclass=DummyObject): class SplinterTokenizerFast(metaclass=DummyObject):
_backends = ["tokenizers"] _backends = ["tokenizers"]
......
...@@ -1588,7 +1588,7 @@ class GenerationTesterMixin: ...@@ -1588,7 +1588,7 @@ class GenerationTesterMixin:
# may fix in the future: the following models fail with assisted decoding, and need model-specific fixes # may fix in the future: the following models fail with assisted decoding, and need model-specific fixes
if any( if any(
model_name in model_class.__name__.lower() model_name in model_class.__name__.lower()
for model_name in ["bigbirdpegasus", "led", "mega", "speech2text", "git", "prophetnet"] for model_name in ["bigbirdpegasus", "led", "mega", "speech2text", "git", "prophetnet", "seamlessm4t"]
): ):
return return
......
# coding=utf-8
# Copyright 2023 HuggingFace Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import itertools
import os
import random
import tempfile
import unittest
import numpy as np
from datasets import load_dataset
from transformers import SeamlessM4TFeatureExtractor, is_speech_available
from transformers.testing_utils import check_json_file_has_correct_format, require_torch
from transformers.utils.import_utils import is_torch_available
from ...test_sequence_feature_extraction_common import SequenceFeatureExtractionTestMixin
if is_torch_available():
import torch
global_rng = random.Random()
# Copied from tests.models.whisper.test_feature_extraction_whisper.floats_list
def floats_list(shape, scale=1.0, rng=None, name=None):
"""Creates a random float32 tensor"""
if rng is None:
rng = global_rng
values = []
for batch_idx in range(shape[0]):
values.append([])
for _ in range(shape[1]):
values[-1].append(rng.random() * scale)
return values
@require_torch
class SeamlessM4TFeatureExtractionTester(unittest.TestCase):
def __init__(
self,
parent,
batch_size=7,
min_seq_length=400,
max_seq_length=2000,
feature_size=10,
padding_value=0.0,
sampling_rate=4_000,
return_attention_mask=True,
do_normalize=True,
stride=2,
):
self.parent = parent
self.batch_size = batch_size
self.min_seq_length = min_seq_length
self.max_seq_length = max_seq_length
self.seq_length_diff = (self.max_seq_length - self.min_seq_length) // (self.batch_size - 1)
self.padding_value = padding_value
self.sampling_rate = sampling_rate
self.return_attention_mask = return_attention_mask
self.do_normalize = do_normalize
self.feature_size = feature_size
self.stride = stride
self.num_mel_bins = feature_size
def prepare_feat_extract_dict(self):
return {
"feature_size": self.feature_size,
"num_mel_bins": self.num_mel_bins,
"padding_value": self.padding_value,
"sampling_rate": self.sampling_rate,
"stride": self.stride,
"return_attention_mask": self.return_attention_mask,
"do_normalize": self.do_normalize,
}
# Copied from tests.models.whisper.test_feature_extraction_whisper.WhisperFeatureExtractionTester.prepare_inputs_for_common
def prepare_inputs_for_common(self, equal_length=False, numpify=False):
def _flatten(list_of_lists):
return list(itertools.chain(*list_of_lists))
if equal_length:
speech_inputs = [floats_list((self.max_seq_length, self.feature_size)) for _ in range(self.batch_size)]
else:
# make sure that inputs increase in size
speech_inputs = [
floats_list((x, self.feature_size))
for x in range(self.min_seq_length, self.max_seq_length, self.seq_length_diff)
]
if numpify:
speech_inputs = [np.asarray(x) for x in speech_inputs]
return speech_inputs
@require_torch
class SeamlessM4TFeatureExtractionTest(SequenceFeatureExtractionTestMixin, unittest.TestCase):
feature_extraction_class = SeamlessM4TFeatureExtractor if is_speech_available() else None
def setUp(self):
self.feat_extract_tester = SeamlessM4TFeatureExtractionTester(self)
def test_feat_extract_from_and_save_pretrained(self):
feat_extract_first = self.feature_extraction_class(**self.feat_extract_dict)
with tempfile.TemporaryDirectory() as tmpdirname:
saved_file = feat_extract_first.save_pretrained(tmpdirname)[0]
check_json_file_has_correct_format(saved_file)
feat_extract_second = self.feature_extraction_class.from_pretrained(tmpdirname)
dict_first = feat_extract_first.to_dict()
dict_second = feat_extract_second.to_dict()
self.assertDictEqual(dict_first, dict_second)
def test_feat_extract_to_json_file(self):
feat_extract_first = self.feature_extraction_class(**self.feat_extract_dict)
with tempfile.TemporaryDirectory() as tmpdirname:
json_file_path = os.path.join(tmpdirname, "feat_extract.json")
feat_extract_first.to_json_file(json_file_path)
feat_extract_second = self.feature_extraction_class.from_json_file(json_file_path)
dict_first = feat_extract_first.to_dict()
dict_second = feat_extract_second.to_dict()
self.assertEqual(dict_first, dict_second)
def test_call(self):
# Tests that all call wrap to encode_plus and batch_encode_plus
feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
# create three inputs of length 800, 1000, and 1200
speech_inputs = [floats_list((1, x))[0] for x in range(800, 1400, 200)]
np_speech_inputs = [np.asarray(speech_input) for speech_input in speech_inputs]
# Test feature size
input_features = feature_extractor(np_speech_inputs, padding=True, return_tensors="np").input_features
self.assertTrue(input_features.ndim == 3)
self.assertTrue(input_features.shape[0] == 3)
self.assertTrue(input_features.shape[-1] == feature_extractor.feature_size * feature_extractor.stride)
# Test not batched input
encoded_sequences_1 = feature_extractor(speech_inputs[0], return_tensors="np").input_features
encoded_sequences_2 = feature_extractor(np_speech_inputs[0], return_tensors="np").input_features
self.assertTrue(np.allclose(encoded_sequences_1, encoded_sequences_2, atol=1e-3))
# Test batched
encoded_sequences_1 = feature_extractor(speech_inputs, return_tensors="np").input_features
encoded_sequences_2 = feature_extractor(np_speech_inputs, return_tensors="np").input_features
for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
# Test 2-D numpy arrays are batched.
speech_inputs = [floats_list((1, x))[0] for x in (800, 800, 800)]
np_speech_inputs = np.asarray(speech_inputs)
encoded_sequences_1 = feature_extractor(speech_inputs, return_tensors="np").input_features
encoded_sequences_2 = feature_extractor(np_speech_inputs, return_tensors="np").input_features
for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
@require_torch
# Copied from tests.models.whisper.test_feature_extraction_whisper.WhisperFeatureExtractionTest.test_double_precision_pad
def test_double_precision_pad(self):
import torch
feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
np_speech_inputs = np.random.rand(100, 32).astype(np.float64)
py_speech_inputs = np_speech_inputs.tolist()
for inputs in [py_speech_inputs, np_speech_inputs]:
np_processed = feature_extractor.pad([{"input_features": inputs}], return_tensors="np")
self.assertTrue(np_processed.input_features.dtype == np.float32)
pt_processed = feature_extractor.pad([{"input_features": inputs}], return_tensors="pt")
self.assertTrue(pt_processed.input_features.dtype == torch.float32)
def _load_datasample(self, id):
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
# automatic decoding with librispeech
speech_sample = ds.sort("id")[id]["audio"]["array"]
return torch.from_numpy(speech_sample).unsqueeze(0)
def test_integration(self):
# fmt: off
EXPECTED_INPUT_FEATURES = torch.tensor(
[
-1.5621, -1.4236, -1.3335, -1.3991, -1.2881, -1.1133, -0.9710, -0.8895,
-0.8280, -0.7376, -0.7194, -0.6896, -0.6849, -0.6788, -0.6545, -0.6610,
-0.6566, -0.5738, -0.5252, -0.5533, -0.5887, -0.6116, -0.5971, -0.4956,
-0.2881, -0.1512, 0.0299, 0.1762, 0.2728, 0.2236
]
)
# fmt: on
input_speech = self._load_datasample(10)
feature_extractor = SeamlessM4TFeatureExtractor()
input_features = feature_extractor(input_speech, return_tensors="pt").input_features
feature_extractor(input_speech, return_tensors="pt").input_features[0, 5, :30]
self.assertEqual(input_features.shape, (1, 279, 160))
self.assertTrue(torch.allclose(input_features[0, 5, :30], EXPECTED_INPUT_FEATURES, atol=1e-4))
def test_zero_mean_unit_variance_normalization_trunc_np_longest(self):
feat_extract = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
audio = self._load_datasample(1)
audio = ((audio - audio.min()) / (audio.max() - audio.min())) * 65535 # Rescale to [0, 65535] to show issue
audio = feat_extract.zero_mean_unit_var_norm([audio], attention_mask=None)[0]
self.assertTrue((audio.mean() < 1e-3).all())
self.assertTrue(((audio.var() - 1).abs() < 1e-3).all())
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Testing suite for the PyTorch SeamlessM4T model. """
import copy
import inspect
import tempfile
import unittest
from transformers import SeamlessM4TConfig, is_speech_available, is_torch_available
from transformers.testing_utils import require_torch, slow, torch_device
from transformers.trainer_utils import set_seed
from transformers.utils import cached_property
from ...generation.test_utils import GenerationTesterMixin
from ...test_configuration_common import ConfigTester
from ...test_modeling_common import (
ModelTesterMixin,
_config_zero_init,
floats_tensor,
ids_tensor,
random_attention_mask,
)
if is_torch_available():
import torch
from transformers import (
SeamlessM4TForSpeechToSpeech,
SeamlessM4TForSpeechToText,
SeamlessM4TForTextToSpeech,
SeamlessM4TForTextToText,
SeamlessM4TModel,
)
from transformers.models.seamless_m4t.modeling_seamless_m4t import (
SEAMLESS_M4T_PRETRAINED_MODEL_ARCHIVE_LIST,
)
if is_speech_available():
from transformers import SeamlessM4TProcessor
class SeamlessM4TModelTester:
def __init__(
self,
parent,
input_modality="speech",
batch_size=2,
seq_length=4,
is_training=True,
use_input_mask=True,
use_token_type_ids=True,
use_labels=True,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
initializer_range=0.02,
max_new_tokens=None,
num_labels=3,
num_choices=4,
scope=None,
vocab_size=20,
t2u_vocab_size=20,
hidden_size=6,
num_hidden_layers=2,
intermediate_size=6,
max_position_embeddings=256,
encoder_layers=2,
decoder_layers=2,
encoder_ffn_dim=6,
decoder_ffn_dim=6,
t2u_encoder_layers=2,
t2u_decoder_layers=2,
t2u_encoder_ffn_dim=6,
t2u_decoder_ffn_dim=6,
num_heads=2,
vocoder_num_spkrs=5,
vocoder_num_langs=5,
upsample_initial_channel=32,
unit_embed_dim=25,
spkr_embed_dim=6,
lang_embed_dim=6,
num_conv_pos_embeddings=8,
unit_hifi_gan_vocab_size=20,
t2u_num_langs=0,
t2u_max_new_tokens=25,
t2u_offset_tgt_lang=0,
vocoder_offset=0,
):
self.parent = parent
self.input_modality = input_modality
self.batch_size = batch_size
self.seq_length = seq_length
self.is_training = is_training
self.use_input_mask = use_input_mask
self.use_token_type_ids = use_token_type_ids
self.hidden_act = hidden_act
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.initializer_range = initializer_range
self.num_labels = num_labels
self.num_choices = num_choices
self.scope = scope
self.vocab_size = vocab_size
self.t2u_vocab_size = t2u_vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.intermediate_size = intermediate_size
self.max_position_embeddings = max_position_embeddings
self.encoder_layers = encoder_layers
self.decoder_layers = decoder_layers
self.encoder_ffn_dim = encoder_ffn_dim
self.decoder_ffn_dim = decoder_ffn_dim
self.t2u_encoder_layers = t2u_encoder_layers
self.t2u_decoder_layers = t2u_decoder_layers
self.t2u_encoder_ffn_dim = t2u_encoder_ffn_dim
self.t2u_decoder_ffn_dim = t2u_decoder_ffn_dim
self.num_heads = num_heads
self.num_attention_heads = num_heads
self.vocoder_num_spkrs = vocoder_num_spkrs
self.vocoder_num_langs = vocoder_num_langs
self.upsample_initial_channel = upsample_initial_channel
self.unit_embed_dim = unit_embed_dim
self.spkr_embed_dim = spkr_embed_dim
self.num_conv_pos_embeddings = num_conv_pos_embeddings
self.lang_embed_dim = lang_embed_dim
self.max_new_tokens = max_new_tokens
self.unit_hifi_gan_vocab_size = unit_hifi_gan_vocab_size
self.t2u_num_langs = t2u_num_langs
self.t2u_max_new_tokens = t2u_max_new_tokens
self.t2u_offset_tgt_lang = t2u_offset_tgt_lang
self.vocoder_offset = vocoder_offset
def prepare_config_and_inputs(self):
if self.input_modality == "text":
inputs = ids_tensor([self.batch_size, self.seq_length], self.vocab_size - 1)
else:
inputs = ids_tensor([self.batch_size, self.seq_length, 160], self.vocab_size - 1).float()
input_mask = None
if self.use_input_mask:
input_mask = random_attention_mask([self.batch_size, self.seq_length])
decoder_input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size - 1)
lm_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
config = self.get_config()
return config, inputs, decoder_input_ids, input_mask, lm_labels
def get_config(self):
return SeamlessM4TConfig(
hidden_act=self.hidden_act,
hidden_dropout_prob=self.hidden_dropout_prob,
attention_probs_dropout_prob=self.attention_probs_dropout_prob,
initializer_range=self.initializer_range,
vocab_size=self.vocab_size,
t2u_vocab_size=self.t2u_vocab_size,
hidden_size=self.hidden_size,
speech_encoder_layers=self.num_heads,
speech_encoder_intermediate_size=self.intermediate_size,
max_position_embeddings=self.max_position_embeddings,
encoder_layers=self.encoder_layers,
decoder_layers=self.decoder_layers,
encoder_ffn_dim=self.encoder_ffn_dim,
decoder_ffn_dim=self.decoder_ffn_dim,
t2u_encoder_layers=self.t2u_encoder_layers,
t2u_decoder_layers=self.t2u_decoder_layers,
t2u_encoder_ffn_dim=self.t2u_encoder_ffn_dim,
t2u_decoder_ffn_dim=self.t2u_decoder_ffn_dim,
num_attention_heads=self.num_heads,
encoder_attention_heads=self.num_heads,
decoder_attention_heads=self.num_heads,
t2u_encoder_attention_heads=self.num_heads,
t2u_decoder_attention_heads=self.num_heads,
speech_encoder_attention_heads=self.num_heads,
unit_hifigan_vocab_vise=self.t2u_vocab_size,
vocoder_num_spkrs=self.vocoder_num_spkrs,
vocoder_num_langs=self.vocoder_num_langs,
upsample_initial_channel=self.upsample_initial_channel,
unit_embed_dim=self.unit_embed_dim,
spkr_embed_dim=self.spkr_embed_dim,
num_conv_pos_embeddings=self.num_conv_pos_embeddings,
lang_embed_dim=self.lang_embed_dim,
max_new_tokens=self.max_new_tokens,
unit_hifi_gan_vocab_size=self.unit_hifi_gan_vocab_size,
t2u_num_langs=self.t2u_num_langs,
t2u_max_new_tokens=self.t2u_max_new_tokens,
t2u_offset_tgt_lang=self.t2u_offset_tgt_lang,
vocoder_offset=self.vocoder_offset,
)
def prepare_config_and_inputs_for_decoder(self):
(
config,
input_ids,
decoder_input_ids,
input_mask,
lm_labels,
) = self.prepare_config_and_inputs()
config.is_decoder = True
encoder_hidden_states = floats_tensor([self.batch_size, self.seq_length, self.hidden_size])
encoder_attention_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
return (
config,
input_ids,
decoder_input_ids,
input_mask,
lm_labels,
encoder_hidden_states,
encoder_attention_mask,
)
def create_and_check_model(self, config, input_ids, decoder_input_ids, input_mask, labels):
model = SeamlessM4TModel(config=config)
model.to(torch_device)
model.eval()
if self.input_modality == "text":
result = model(input_ids=input_ids, attention_mask=input_mask, decoder_input_ids=decoder_input_ids)
result = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length, self.vocab_size))
else:
result = model(input_features=input_ids, attention_mask=input_mask, decoder_input_ids=decoder_input_ids)
result = model(input_features=input_ids, decoder_input_ids=decoder_input_ids)
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length, self.vocab_size))
decoder_output = result.logits
decoder_past = result.past_key_values
encoder_output = result.encoder_last_hidden_state
if self.input_modality == "text":
seq_length = self.seq_length
else:
# if speech, expected length has been subsampled.
seq_length = model._compute_sub_sample_lengths_from_attention_mask(input_mask).max().item()
self.parent.assertEqual(encoder_output.size(), (self.batch_size, seq_length, self.hidden_size))
self.parent.assertEqual(decoder_output.size(), (self.batch_size, decoder_input_ids.shape[1], self.vocab_size))
# There should be `num_layers` key value embeddings stored in decoder_past
self.parent.assertEqual(len(decoder_past), config.decoder_layers)
# There should be a self attn key, a self attn value, a cross attn key and a cross attn value stored in each decoder_past tuple
self.parent.assertEqual(len(decoder_past[0]), 4)
def create_and_check_decoder_model_past_large_inputs(
self,
config,
input_ids,
decoder_input_ids,
input_mask,
lm_labels,
encoder_hidden_states,
encoder_attention_mask,
):
config.is_decoder = True
model = SeamlessM4TModel(config=config)
model.to(torch_device)
model.eval()
# make sure no pad token in decoder_input_ids
decoder_input_ids = torch.clamp(decoder_input_ids, config.pad_token_id + 1)
# first forward pass
outputs = model(
input_ids, decoder_input_ids=decoder_input_ids, decoder_attention_mask=input_mask, use_cache=True
)
past_key_values = outputs.past_key_values
# create hypothetical multiple next token and extent to next_input_ids
next_tokens = ids_tensor((self.batch_size, 3), config.vocab_size)
next_mask = ids_tensor((self.batch_size, 3), vocab_size=2)
# append to next input_ids and
next_input_ids = torch.cat([decoder_input_ids, next_tokens], dim=-1)
next_attention_mask = torch.cat([input_mask, next_mask], dim=-1)
output_from_no_past = model(
input_ids,
decoder_input_ids=next_input_ids,
decoder_attention_mask=next_attention_mask,
output_hidden_states=True,
)
output_from_no_past = output_from_no_past["decoder_hidden_states"][0]
output_from_past = model(
input_ids,
decoder_input_ids=next_tokens,
decoder_attention_mask=next_attention_mask,
past_key_values=past_key_values,
output_hidden_states=True,
)["decoder_hidden_states"][0]
# select random slice
random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
# test that outputs are equal for slice
self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
(
config,
input_ids,
decoder_input_ids,
input_mask,
lm_labels,
) = config_and_inputs
input_name = "input_ids" if self.input_modality == "text" else "input_features"
inputs_dict = {
input_name: input_ids,
"attention_mask": input_mask,
"decoder_input_ids": decoder_input_ids,
"labels": lm_labels,
}
return config, inputs_dict
@require_torch
class SeamlessM4TModelWithSpeechInputTest(ModelTesterMixin, unittest.TestCase):
is_encoder_decoder = True
fx_compatible = False
test_missing_keys = False
test_pruning = False
test_model_parallel = False
test_resize_embeddings = False
test_headmasking = False
test_torchscript = False
all_model_classes = (
(
SeamlessM4TModel,
SeamlessM4TForSpeechToSpeech,
SeamlessM4TForSpeechToText,
)
if is_torch_available()
else ()
)
all_generative_model_classes = (SeamlessM4TForSpeechToText,) if is_torch_available() else ()
input_name = "input_features"
def setUp(self):
self.model_tester = SeamlessM4TModelTester(self, input_modality="speech")
self.config_tester = ConfigTester(self, config_class=SeamlessM4TConfig)
def test_config(self):
self.config_tester.run_common_tests()
def test_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_model(*config_and_inputs)
@slow
def test_model_from_pretrained(self):
for model_name in SEAMLESS_M4T_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
model = SeamlessM4TModel.from_pretrained(model_name)
self.assertIsNotNone(model)
def _get_input_ids_and_config(self, batch_size=2):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
input_ids = inputs_dict[self.input_name]
# cut to half length & take max batch_size 3
sequence_length = input_ids.shape[-1] // 2
input_ids = input_ids[:batch_size, :sequence_length]
# generate max 3 tokens
max_length = input_ids.shape[-1] + 3
if config.eos_token_id is not None and config.pad_token_id is None:
# hack to allow generate for models such as GPT2 as is done in `generate()`
if isinstance(config.eos_token_id, int):
config.eos_token_id = [config.eos_token_id]
config.pad_token_id = config.eos_token_id[0]
attention_mask = torch.ones(input_ids.shape[:2], dtype=torch.long)[:batch_size, :sequence_length]
return config, input_ids.float(), attention_mask, max_length
@staticmethod
def _get_encoder_outputs(
model, input_ids, attention_mask, output_attentions=None, output_hidden_states=None, num_interleave=1
):
encoder = model.get_encoder()
encoder_outputs = encoder(
input_ids,
attention_mask=attention_mask,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
)
encoder_outputs["last_hidden_state"] = encoder_outputs.last_hidden_state.repeat_interleave(
num_interleave, dim=0
)
input_ids = (
torch.zeros(input_ids.shape[:2], dtype=torch.int64, layout=input_ids.layout, device=input_ids.device)
+ model._get_decoder_start_token_id()
)
attention_mask = None
return encoder_outputs, input_ids, attention_mask
def test_initialization(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
configs_no_init = _config_zero_init(config)
for model_class in self.all_model_classes:
model = model_class(config=configs_no_init)
for name, param in model.named_parameters():
uniform_init_parms = [
"conv.weight",
"masked_spec_embed",
"codevectors",
"quantizer.weight_proj.weight",
"project_hid.weight",
"project_hid.bias",
"project_q.weight",
"project_q.bias",
"pos_bias_v",
"pos_bias_u",
"pointwise_conv1",
"pointwise_conv2",
"feature_projection.projection.weight",
"feature_projection.projection.bias",
"objective.weight",
"adapter",
]
if param.requires_grad:
if any(x in name for x in uniform_init_parms):
self.assertTrue(
-1.0 <= ((param.data.mean() * 1e9).round() / 1e9).item() <= 1.0,
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
)
else:
self.assertIn(
((param.data.mean() * 1e9).round() / 1e9).item(),
[0.0, 1.0],
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
)
@unittest.skip(reason="SeamlessM4TSpeechEncoder doesn't have an embedding layer")
def test_inputs_embeds(self):
pass
@unittest.skip(
reason="Expected missing keys serve when using SeamlessM4TForXXX.from_pretrained from a checkpoint saved by SeamlessM4TModel.save_pretrained."
)
def test_model_weights_reload_no_missing_tied_weights(self):
pass
@unittest.skip(
reason="SeamlessM4TModel is base class but has actually a bigger architecture than seamlessM4T task-specific models."
)
def test_save_load_fast_init_to_base(self):
pass
@unittest.skip(reason="The speech encoder doesn't support head masking")
def test_generate_with_head_masking(self):
pass
@unittest.skip(reason="SeamlessM4TModel can takes input_ids or input_features")
def test_forward_signature(self):
pass
@unittest.skip(reason="SeamlessM4T has no base model")
def test_save_load_fast_init_from_base(self):
pass
def test_attention_outputs(self):
# expected length is subsampled so need to change a bit this test
if not self.has_attentions:
self.skipTest(reason="Model does not output attentions")
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
config.return_dict = True
seq_len = getattr(self.model_tester, "seq_length", None)
decoder_seq_length = getattr(self.model_tester, "decoder_seq_length", seq_len)
encoder_seq_length = getattr(self.model_tester, "encoder_seq_length", seq_len)
decoder_key_length = getattr(self.model_tester, "decoder_key_length", decoder_seq_length)
encoder_key_length = getattr(self.model_tester, "key_length", encoder_seq_length)
# no more chunk_length test
for model_class in self.all_model_classes:
inputs_dict["output_attentions"] = True
inputs_dict["output_hidden_states"] = False
config.return_dict = True
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
attentions = outputs.encoder_attentions if config.is_encoder_decoder else outputs.attentions
self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
# check that output_attentions also work using config
del inputs_dict["output_attentions"]
config.output_attentions = True
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
attentions = outputs.encoder_attentions if config.is_encoder_decoder else outputs.attentions
self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
self.assertListEqual(
list(attentions[0].shape[-3:]),
[self.model_tester.num_attention_heads, encoder_seq_length, encoder_key_length],
)
out_len = len(outputs)
if self.is_encoder_decoder:
correct_outlen = 5
# loss is at first position
if "labels" in inputs_dict:
correct_outlen += 1 # loss is added to beginning
if "past_key_values" in outputs:
correct_outlen += 1 # past_key_values have been returned
self.assertEqual(out_len, correct_outlen)
# decoder attentions
decoder_attentions = outputs.decoder_attentions
self.assertIsInstance(decoder_attentions, (list, tuple))
self.assertEqual(len(decoder_attentions), self.model_tester.num_hidden_layers)
self.assertListEqual(
list(decoder_attentions[0].shape[-3:]),
[self.model_tester.num_attention_heads, decoder_seq_length, decoder_key_length],
)
# cross attentions
cross_attentions = outputs.cross_attentions
self.assertIsInstance(cross_attentions, (list, tuple))
self.assertEqual(len(cross_attentions), self.model_tester.num_hidden_layers)
sub_sampled_length = (
model._compute_sub_sample_lengths_from_attention_mask(inputs_dict["attention_mask"]).max().item()
)
self.assertListEqual(
list(cross_attentions[0].shape[-3:]),
[
self.model_tester.num_attention_heads,
decoder_seq_length,
sub_sampled_length,
],
)
# Check attention is always last and order is fine
inputs_dict["output_attentions"] = True
inputs_dict["output_hidden_states"] = True
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
if hasattr(self.model_tester, "num_hidden_states_types"):
added_hidden_states = self.model_tester.num_hidden_states_types
elif self.is_encoder_decoder:
added_hidden_states = 2
else:
added_hidden_states = 1
self.assertEqual(out_len + added_hidden_states, len(outputs))
self_attentions = outputs.encoder_attentions if config.is_encoder_decoder else outputs.attentions
self.assertEqual(len(self_attentions), self.model_tester.num_hidden_layers)
self.assertListEqual(
list(self_attentions[0].shape[-3:]),
[self.model_tester.num_attention_heads, encoder_seq_length, encoder_key_length],
)
@require_torch
class SeamlessM4TModelWithTextInputTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
is_encoder_decoder = True
fx_compatible = False
test_missing_keys = False
test_pruning = False
test_model_parallel = False
test_resize_embeddings = True
test_headmasking = False
test_torchscript = False
all_model_classes = (
(
SeamlessM4TModel,
SeamlessM4TForTextToSpeech,
SeamlessM4TForTextToText,
)
if is_torch_available()
else ()
)
all_generative_model_classes = (SeamlessM4TForTextToText,) if is_torch_available() else ()
def setUp(self):
self.model_tester = SeamlessM4TModelTester(self, input_modality="text")
self.config_tester = ConfigTester(self, config_class=SeamlessM4TConfig)
def test_config(self):
self.config_tester.run_common_tests()
def test_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_model(*config_and_inputs)
@slow
def test_model_from_pretrained(self):
for model_name in SEAMLESS_M4T_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
model = SeamlessM4TModel.from_pretrained(model_name)
self.assertIsNotNone(model)
def test_initialization(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
configs_no_init = _config_zero_init(config)
for model_class in self.all_model_classes:
model = model_class(config=configs_no_init)
for name, param in model.named_parameters():
uniform_init_parms = [
"conv.weight",
"masked_spec_embed",
"codevectors",
"quantizer.weight_proj.weight",
"project_hid.weight",
"project_hid.bias",
"project_q.weight",
"project_q.bias",
"pos_bias_v",
"pos_bias_u",
"pointwise_conv1",
"pointwise_conv2",
"feature_projection.projection.weight",
"feature_projection.projection.bias",
"objective.weight",
"adapter",
]
if param.requires_grad:
if any(x in name for x in uniform_init_parms):
self.assertTrue(
-1.0 <= ((param.data.mean() * 1e9).round() / 1e9).item() <= 1.0,
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
)
else:
self.assertIn(
((param.data.mean() * 1e9).round() / 1e9).item(),
[0.0, 1.0],
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
)
@unittest.skip(
reason="Expected missing keys serve when using SeamlessM4TForXXX.from_pretrained from a checkpoint saved by SeamlessM4TModel.save_pretrained."
)
def test_model_weights_reload_no_missing_tied_weights(self):
pass
def test_generate_with_head_masking(self):
"""Test designed for encoder-decoder models to ensure the attention head masking is used."""
attention_names = ["encoder_attentions", "decoder_attentions", "cross_attentions"]
for model_class in self.all_generative_model_classes:
config, input_ids, attention_mask, max_length = self._get_input_ids_and_config()
model = model_class(config).to(torch_device).eval()
head_masking = {
"head_mask": torch.zeros(config.encoder_layers, config.encoder_attention_heads, device=torch_device),
"decoder_head_mask": torch.zeros(
config.decoder_layers, config.decoder_attention_heads, device=torch_device
),
"cross_attn_head_mask": torch.zeros(
config.decoder_layers, config.decoder_attention_heads, device=torch_device
),
}
signature = inspect.signature(model.forward)
# We want to test only models where encoder/decoder head masking is implemented
if not set(head_masking.keys()) < {*signature.parameters.keys()}:
continue
for attn_name, (name, mask) in zip(attention_names, head_masking.items()):
out = model.generate(
input_ids,
attention_mask=attention_mask,
num_beams=1,
output_attentions=True,
return_dict_in_generate=True,
remove_invalid_values=True,
**{name: mask},
)
# We check the state of decoder_attentions and cross_attentions just from the last step
attn_weights = out[attn_name] if attn_name == attention_names[0] else out[attn_name][-1]
self.assertEqual(sum([w.sum().item() for w in attn_weights]), 0.0)
@unittest.skip(reason="SeamlessM4TModel can take input_ids or input_features")
def test_forward_signature(self):
pass
def test_decoder_model_past_with_large_inputs(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs_for_decoder()
self.model_tester.create_and_check_decoder_model_past_large_inputs(*config_and_inputs)
@unittest.skip(
reason="SeamlessM4TModel is base class but has actually a bigger architecture than seamlessM4T task-specific models."
)
def test_save_load_fast_init_to_base(self):
pass
@unittest.skip(reason="SeamlessM4T has no base model")
def test_save_load_fast_init_from_base(self):
pass
@require_torch
class SeamlessM4TGenerationTest(unittest.TestCase):
# test that non-standard generation works
# test generation of: SeamlessM4TModel, SeamlessM4TForSpeechToSpeech, SeamlessM4TForSpeechToText, SeamlessM4TForTextToSpeech
def setUp(self):
self.speech_model_tester = SeamlessM4TModelTester(self, input_modality="speech")
self.text_model_tester = SeamlessM4TModelTester(self, input_modality="text")
self.tmpdirname = tempfile.mkdtemp()
def update_generation(self, model):
lang_code_to_id = {
"fra": 4,
"eng": 4,
}
generation_config = copy.deepcopy(model.generation_config)
generation_config.__setattr__("text_decoder_lang_to_code_id", lang_code_to_id)
generation_config.__setattr__("t2u_lang_code_to_id", lang_code_to_id)
generation_config.__setattr__("vocoder_lang_code_to_id", lang_code_to_id)
generation_config._from_model_config = False
model.generation_config = generation_config
def prepare_text_input(self):
config, inputs, decoder_input_ids, input_mask, lm_labels = self.text_model_tester.prepare_config_and_inputs()
input_dict = {
"input_ids": inputs,
"attention_mask": input_mask,
"tgt_lang": "eng",
"num_beams": 2,
"do_sample": True,
}
return config, input_dict
def prepare_speech_input(self):
config, inputs, decoder_input_ids, input_mask, lm_labels = self.speech_model_tester.prepare_config_and_inputs()
input_dict = {
"input_features": inputs,
"attention_mask": input_mask,
"tgt_lang": "fra",
"num_beams": 2,
"do_sample": True,
}
return config, input_dict
def prepare_speech_and_text_input(self):
config, inputs, decoder_input_ids, input_mask, lm_labels = self.speech_model_tester.prepare_config_and_inputs()
input_speech = {
"input_features": inputs,
"attention_mask": input_mask,
"tgt_lang": "fra",
"num_beams": 2,
"do_sample": True,
}
config, inputs, decoder_input_ids, input_mask, lm_labels = self.text_model_tester.prepare_config_and_inputs()
input_text = {
"input_ids": inputs,
"attention_mask": input_mask,
"tgt_lang": "eng",
"num_beams": 2,
"do_sample": True,
}
return config, input_speech, input_text
def factory_generation_speech_test(self, model, inputs):
set_seed(0)
output = model.generate(**inputs)
return output
def test_speech_generation(self):
config, input_speech, input_text = self.prepare_speech_and_text_input()
model = SeamlessM4TModel(config=config)
self.update_generation(model)
model.save_pretrained(self.tmpdirname)
model.to(torch_device)
model.eval()
output_original_text = self.factory_generation_speech_test(model, input_text)
output_original_speech = self.factory_generation_speech_test(model, input_speech)
model = SeamlessM4TForTextToSpeech.from_pretrained(self.tmpdirname)
self.update_generation(model)
model.to(torch_device)
model.eval()
output_text = self.factory_generation_speech_test(model, input_text)
model = SeamlessM4TForSpeechToSpeech.from_pretrained(self.tmpdirname)
self.update_generation(model)
model.to(torch_device)
model.eval()
output_speech = self.factory_generation_speech_test(model, input_speech)
# test same text output from input text
self.assertListEqual(output_original_text[0].ravel().tolist(), output_text[0].ravel().tolist())
self.assertListEqual(output_original_text[1].ravel().tolist(), output_text[1].ravel().tolist())
# test same speech output from input text
self.assertListEqual(output_original_speech[0].ravel().tolist(), output_speech[0].ravel().tolist())
self.assertListEqual(output_original_speech[1].ravel().tolist(), output_speech[1].ravel().tolist())
def test_text_generation(self):
config, input_speech, input_text = self.prepare_speech_and_text_input()
# to return speech
input_speech["generate_speech"] = False
input_text["generate_speech"] = False
model = SeamlessM4TModel(config=config)
self.update_generation(model)
model.save_pretrained(self.tmpdirname)
model.to(torch_device)
model.eval()
output_original_text = self.factory_generation_speech_test(model, input_text)
output_original_speech = self.factory_generation_speech_test(model, input_speech)
# other models don't need it
input_speech.pop("generate_speech")
input_text.pop("generate_speech")
model = SeamlessM4TForTextToText.from_pretrained(self.tmpdirname)
self.update_generation(model)
model.to(torch_device)
model.eval()
output_text = self.factory_generation_speech_test(model, input_text)
model = SeamlessM4TForSpeechToText.from_pretrained(self.tmpdirname)
self.update_generation(model)
model.to(torch_device)
model.eval()
output_speech = self.factory_generation_speech_test(model, input_speech)
# test same text output from input text
self.assertListEqual(output_original_text[0].ravel().tolist(), output_text.ravel().tolist())
# test same speech output from input text
self.assertListEqual(output_original_speech[0].ravel().tolist(), output_speech.ravel().tolist())
def test_generation(self):
config, input_speech, input_text = self.prepare_speech_and_text_input()
input_speech["num_beams"] = 3
input_speech["do_sample"] = True
input_speech["num_return_sequences"] = 3
input_text["num_beams"] = 3
input_text["do_sample"] = True
input_text["num_return_sequences"] = 3
for model_class in [SeamlessM4TForSpeechToSpeech, SeamlessM4TForSpeechToText, SeamlessM4TModel]:
model = model_class(config=config)
self.update_generation(model)
model.to(torch_device)
model.eval()
output = model.generate(**input_speech)
output = output[0] if isinstance(output, tuple) else output
self.assertEqual(output.shape[0], 3 * input_speech["input_features"].shape[0])
for model_class in [SeamlessM4TForTextToSpeech, SeamlessM4TForTextToText, SeamlessM4TModel]:
model = model_class(config=config)
self.update_generation(model)
model.to(torch_device)
model.eval()
output = model.generate(**input_text)
output = output[0] if isinstance(output, tuple) else output
self.assertEqual(output.shape[0], 3 * input_text["input_ids"].shape[0])
@require_torch
class SeamlessM4TModelIntegrationTest(unittest.TestCase):
repo_id = "facebook/hf-seamless-m4t-medium"
def assertListAlmostEqual(self, list1, list2, tol=1e-3):
self.assertEqual(len(list1), len(list2))
for a, b in zip(list1, list2):
self.assertAlmostEqual(a, b, delta=tol)
@cached_property
def processor(self):
return SeamlessM4TProcessor.from_pretrained(self.repo_id)
@cached_property
def input_text(self):
# corresponds to "C'est un test." with seamlessM4T_medium checkpoint
# fmt: off
input_ids = torch.tensor([[256057, 152, 248116, 354, 159, 7356, 248075, 3]])
# fmt: on
input_ids = input_ids.to(torch_device)
attention_mask = torch.ones_like(input_ids).to(torch_device)
inputs = {
"attention_mask": attention_mask,
"input_ids": input_ids,
}
return inputs
@cached_property
def input_audio(self):
set_seed(0)
seq_len = 20000
sampling_rate = 16000
input_features = torch.rand((2, seq_len))
return self.processor(audios=[input_features.tolist()], sampling_rate=sampling_rate, return_tensors="pt").to(
torch_device
)
def factory_test_task(self, class1, class2, inputs, class1_kwargs, class2_kwargs):
model1 = class1.from_pretrained(self.repo_id).to(torch_device)
model2 = class2.from_pretrained(self.repo_id).to(torch_device)
set_seed(0)
output_1 = model1.generate(**inputs, **class1_kwargs)
set_seed(0)
output_2 = model2.generate(**inputs, **class2_kwargs)
for key in output_1:
if isinstance(output_1[key], torch.Tensor):
if len(output_1[key].shape) == 0:
self.assertEqual(output_1[key].item(), output_2[key].item())
else:
self.assertListAlmostEqual(output_1[key].squeeze().tolist(), output_2[key].squeeze().tolist())
@slow
def test_to_eng_text(self):
model = SeamlessM4TModel.from_pretrained(self.repo_id).to(torch_device)
# test text - tgt lang: eng
# fmt: off
expected_text_tokens = [3, 256047, 3291, 248116, 248066, 9, 7356, 248075, 3]
# fmt: on
# fmt: off
expected_unit_tokens = [
2,10051,8980,8212,949,1270,4311,1123,5918,2333,5311,3882,2415,5284,1123,612,8816,6370,5386,7334,4345,5645,
9437,5748,1378,9818,4319,7968,7375,2909,9119,5151,8728,5335,3896,4013,8939,8885,6048,9530,3167,5833,1072,693,
431,9867,364,7909,4608,5938,1889,9984,7947,4944,6171,3767,9861,9169,1187,8365,4571,7635,7784,7635,800,2393,
32,5380,5852,8289,2530,2762,1833,2056,3553,4641,3553,5683,370,2288,1344,1518,7534,703,8359,7699,2
]
# fmt: on
# fmt: off
expected_wav_slice = [-3e-05, -0.0004, -0.00037, -0.00013, -6e-05, 0.00012, -0.00016, 0.00025, 7e-05, -3e-05]
# fmt: on
set_seed(0)
output = model.generate(**self.input_text, num_beams=1, tgt_lang="eng", return_intermediate_token_ids=True)
self.assertListEqual(expected_text_tokens, output.sequences.squeeze().tolist())
# FOR NOW, only first units correspondance
self.assertListEqual(expected_unit_tokens[:10], output.unit_sequences.squeeze().tolist()[:10])
self.assertListAlmostEqual(expected_wav_slice, output.waveform.squeeze().tolist()[50:60])
@slow
def test_to_swh_text(self):
model = SeamlessM4TModel.from_pretrained(self.repo_id).to(torch_device)
# test text - tgt lang: swh
# fmt: off
expected_text_tokens = [3, 256168, 1665, 188589, 7040, 248075, 3]
# fmt: on
# fmt: off
expected_unit_tokens = [
2,10071,5729,9995,3089,7546,1204,1721,2532,4340,5623,3496,432,7730,9096,7677,3143,8211,6447,8399,4248,3565,
4529,7700,9308,217,6476,3485,9667,3194,8476,4923,5593,1148,4466,7416,4872,463,4872,253,2348,4640,3450,2133,
6318,2806,817,7613,2698,6563,8712,8344,9286,6878,6387,4281,6387,640,6387,3200,640,8355,640,6708,979,1738,2
]
# fmt: on
# fmt: off
expected_wav_slice = [1e-05, -7e-05, -4e-05, -4e-05, -6e-05, -9e-05, -0.0001, -2e-05, -7e-05, -2e-05]
# fmt: on
set_seed(0)
output = model.generate(**self.input_text, num_beams=1, tgt_lang="swh", return_intermediate_token_ids=True)
self.assertListEqual(expected_text_tokens, output.sequences.squeeze().tolist())
self.assertListEqual(expected_unit_tokens[:10], output.unit_sequences.squeeze().tolist()[:10])
self.assertListAlmostEqual(expected_wav_slice, output.waveform.squeeze().tolist()[50:60])
@slow
def test_to_rus_speech(self):
model = SeamlessM4TModel.from_pretrained(self.repo_id).to(torch_device)
# test audio - tgt lang: rus
# fmt: off
expected_text_tokens = [3, 256147, 1197, 73565, 3413, 537, 233331, 248075, 3]
# fmt: on
# fmt: off
expected_unit_tokens = [
2, 10067, 5729, 4798, 9631, 8378, 4446, 2393, 6901, 5983, 2817, 4629, 8532, 1991, 2931, 8576, 8857, 5936, 4317,
9000, 7740, 7995, 1225, 5980, 6094, 1420, 5373, 8771, 6600, 4487, 7029, 3630, 6740, 4870, 1483, 3003, 5585, 5511,
7465, 3222, 32, 6272, 1950, 3120, 5368, 639, 3713, 5935, 7943, 567, 6129, 6822, 1226, 5063, 9878, 7756, 8825, 1078, 5943,
457, 9282, 9668, 817, 7613, 2698, 6563, 8712, 8704, 9286, 8704, 6387, 4281, 6387, 640, 3200, 6387, 640, 8355, 6708, 979, 1738, 2
]
# fmt: on
# fmt: off
expected_wav_slice = [0.00013, 0.00012, 0.00014, 3e-05, 0.0, -6e-05, -0.00018, -0.00016, -0.00021, -0.00018]
# fmt: on
set_seed(0)
output = model.generate(**self.input_audio, num_beams=1, tgt_lang="rus", return_intermediate_token_ids=True)
self.assertListEqual(expected_text_tokens, output.sequences.squeeze().tolist())
self.assertListEqual(expected_unit_tokens[:10], output.unit_sequences.squeeze().tolist()[:10])
self.assertListAlmostEqual(expected_wav_slice, output.waveform.squeeze().tolist()[50:60])
@slow
def test_text_to_text_model(self):
kwargs1 = {"tgt_lang": "eng", "return_intermediate_token_ids": True, "generate_speech": False}
kwargs2 = {
"tgt_lang": "eng",
"output_hidden_states": True,
"return_dict_in_generate": True,
"output_scores": True,
}
self.factory_test_task(SeamlessM4TModel, SeamlessM4TForTextToText, self.input_text, kwargs1, kwargs2)
@slow
def test_speech_to_text_model(self):
kwargs1 = {"tgt_lang": "eng", "return_intermediate_token_ids": True, "generate_speech": False}
kwargs2 = {
"tgt_lang": "eng",
"output_hidden_states": True,
"return_dict_in_generate": True,
"output_scores": True,
}
self.factory_test_task(SeamlessM4TModel, SeamlessM4TForSpeechToText, self.input_audio, kwargs1, kwargs2)
@slow
def test_speech_to_speech_model(self):
kwargs1 = {"tgt_lang": "eng", "return_intermediate_token_ids": True}
self.factory_test_task(SeamlessM4TModel, SeamlessM4TForSpeechToSpeech, self.input_audio, kwargs1, kwargs1)
@slow
def test_text_to_speech_model(self):
kwargs1 = {"tgt_lang": "eng", "return_intermediate_token_ids": True}
self.factory_test_task(SeamlessM4TModel, SeamlessM4TForTextToSpeech, self.input_text, kwargs1, kwargs1)
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import shutil
import tempfile
import unittest
from transformers import SeamlessM4TFeatureExtractor, SeamlessM4TProcessor
from transformers.models.seamless_m4t import (
SeamlessM4TTokenizer,
SeamlessM4TTokenizerFast,
)
from transformers.testing_utils import require_torch
from .test_feature_extraction_seamless_m4t import floats_list
@require_torch
class SeamlessM4TProcessorTest(unittest.TestCase):
def setUp(self):
self.checkpoint = "facebook/hf-seamless-m4t-medium"
self.tmpdirname = tempfile.mkdtemp()
def get_tokenizer(self, **kwargs):
return SeamlessM4TTokenizer.from_pretrained(self.checkpoint, **kwargs)
def get_feature_extractor(self, **kwargs):
return SeamlessM4TFeatureExtractor.from_pretrained(self.checkpoint, **kwargs)
def tearDown(self):
shutil.rmtree(self.tmpdirname)
def test_save_load_pretrained_default(self):
tokenizer = self.get_tokenizer()
feature_extractor = self.get_feature_extractor()
processor = SeamlessM4TProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
processor.save_pretrained(self.tmpdirname)
processor = SeamlessM4TProcessor.from_pretrained(self.tmpdirname)
self.assertEqual(processor.tokenizer.get_vocab(), tokenizer.get_vocab())
tokenizer_instance = isinstance(processor.tokenizer, SeamlessM4TTokenizerFast) or isinstance(
processor.tokenizer, SeamlessM4TTokenizer
)
self.assertTrue(tokenizer_instance)
self.assertEqual(processor.feature_extractor.to_json_string(), feature_extractor.to_json_string())
self.assertIsInstance(processor.feature_extractor, SeamlessM4TFeatureExtractor)
def test_save_load_pretrained_additional_features(self):
processor = SeamlessM4TProcessor(
tokenizer=self.get_tokenizer(), feature_extractor=self.get_feature_extractor()
)
processor.save_pretrained(self.tmpdirname)
tokenizer_add_kwargs = self.get_tokenizer(bos_token="(BOS)", eos_token="(EOS)")
feature_extractor_add_kwargs = self.get_feature_extractor(do_normalize=False, padding_value=1.0)
processor = SeamlessM4TProcessor.from_pretrained(
self.tmpdirname, bos_token="(BOS)", eos_token="(EOS)", do_normalize=False, padding_value=1.0
)
self.assertEqual(processor.feature_extractor.to_json_string(), feature_extractor_add_kwargs.to_json_string())
self.assertIsInstance(processor.feature_extractor, SeamlessM4TFeatureExtractor)
self.assertEqual(processor.tokenizer.get_vocab(), tokenizer_add_kwargs.get_vocab())
tokenizer_instance = isinstance(processor.tokenizer, SeamlessM4TTokenizerFast) or isinstance(
processor.tokenizer, SeamlessM4TTokenizer
)
self.assertTrue(tokenizer_instance)
# Copied from test.models.whisper.test_processor_whisper.WhisperProcessorTest.test_feature_extractor with Whisper->SeamlessM4T
def test_feature_extractor(self):
feature_extractor = self.get_feature_extractor()
tokenizer = self.get_tokenizer()
processor = SeamlessM4TProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
raw_speech = floats_list((3, 1000))
input_feat_extract = feature_extractor(raw_speech, return_tensors="np")
input_processor = processor(audios=raw_speech, return_tensors="np")
for key in input_feat_extract.keys():
self.assertAlmostEqual(input_feat_extract[key].sum(), input_processor[key].sum(), delta=1e-2)
# Copied from test.models.whisper.test_processor_whisper.WhisperProcessorTest.test_tokenizer with Whisper->SeamlessM4T
def test_tokenizer(self):
feature_extractor = self.get_feature_extractor()
tokenizer = self.get_tokenizer()
processor = SeamlessM4TProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
input_str = "This is a test string"
encoded_processor = processor(text=input_str)
encoded_tok = tokenizer(input_str)
for key in encoded_tok.keys():
self.assertListEqual(encoded_tok[key], encoded_processor[key])
# Copied from test.models.whisper.test_processor_whisper.WhisperProcessorTest.test_tokenizer_decode with Whisper->SeamlessM4T
def test_tokenizer_decode(self):
feature_extractor = self.get_feature_extractor()
tokenizer = self.get_tokenizer()
processor = SeamlessM4TProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
predicted_ids = [[1, 4, 5, 8, 1, 0, 8], [3, 4, 3, 1, 1, 8, 9]]
decoded_processor = processor.batch_decode(predicted_ids)
decoded_tok = tokenizer.batch_decode(predicted_ids)
self.assertListEqual(decoded_tok, decoded_processor)
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import tempfile
import unittest
from transformers import (
SPIECE_UNDERLINE,
AddedToken,
BatchEncoding,
PreTrainedTokenizerFast,
SeamlessM4TTokenizer,
SeamlessM4TTokenizerFast,
is_torch_available,
)
from transformers.testing_utils import (
get_tests_dir,
nested_simplify,
require_sentencepiece,
require_tokenizers,
require_torch,
)
from ...test_tokenization_common import TokenizerTesterMixin
SAMPLE_VOCAB = get_tests_dir("fixtures/test_sentencepiece.model")
if is_torch_available():
from transformers.models.m2m_100.modeling_m2m_100 import shift_tokens_right
EN_CODE = 256047
RO_CODE = 256145
SMALL_TRAINING_CORPUS = [
["This is the first sentence.", "This is the second one."],
["This sentence (contains #) over symbols and numbers 12 3.", "But not this one."],
]
@require_sentencepiece
@require_tokenizers
class SeamlessM4TTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokenizer_class = SeamlessM4TTokenizer
rust_tokenizer_class = SeamlessM4TTokenizerFast
test_rust_tokenizer = True
test_sentencepiece = True
from_pretrained_kwargs = {}
def setUp(self):
super().setUp()
# We have a SentencePiece fixture for testing
tokenizer = SeamlessM4TTokenizer(SAMPLE_VOCAB, keep_accents=True)
tokenizer.save_pretrained(self.tmpdirname)
def test_full_tokenizer(self):
tokenizer = SeamlessM4TTokenizer(SAMPLE_VOCAB, keep_accents=True)
tokens = tokenizer.tokenize("This is a test")
self.assertListEqual(tokens, ["▁This", "▁is", "▁a", "▁t", "est"])
self.assertListEqual(
tokenizer.convert_tokens_to_ids(tokens),
[value + tokenizer.fairseq_offset for value in [285, 46, 10, 170, 382]],
)
tokens = tokenizer.tokenize("I was born in 92000, and this is falsé.")
self.assertListEqual(
tokens,
[
SPIECE_UNDERLINE + "I",
SPIECE_UNDERLINE + "was",
SPIECE_UNDERLINE + "b",
"or",
"n",
SPIECE_UNDERLINE + "in",
SPIECE_UNDERLINE + "",
"9",
"2",
"0",
"0",
"0",
",",
SPIECE_UNDERLINE + "and",
SPIECE_UNDERLINE + "this",
SPIECE_UNDERLINE + "is",
SPIECE_UNDERLINE + "f",
"al",
"s",
"é",
".",
],
)
ids = tokenizer.convert_tokens_to_ids(tokens)
self.assertListEqual(
ids,
[
value + tokenizer.fairseq_offset
for value in [8, 21, 84, 55, 24, 19, 7, 0, 602, 347, 347, 347, 3, 12, 66, 46, 72, 80, 6, 0, 4]
],
)
back_tokens = tokenizer.convert_ids_to_tokens(ids)
self.assertListEqual(
back_tokens,
[
SPIECE_UNDERLINE + "I",
SPIECE_UNDERLINE + "was",
SPIECE_UNDERLINE + "b",
"or",
"n",
SPIECE_UNDERLINE + "in",
SPIECE_UNDERLINE + "",
"<unk>",
"2",
"0",
"0",
"0",
",",
SPIECE_UNDERLINE + "and",
SPIECE_UNDERLINE + "this",
SPIECE_UNDERLINE + "is",
SPIECE_UNDERLINE + "f",
"al",
"s",
"<unk>",
".",
],
)
def test_maximum_encoding_length_single_input(self):
tokenizers = self.get_tokenizers(do_lower_case=False, model_max_length=100)
for tokenizer in tokenizers:
with self.subTest(f"{tokenizer.__class__.__name__}"):
seq_0, ids = self.get_clean_sequence(tokenizer, max_length=20)
sequence = tokenizer.encode(seq_0, add_special_tokens=False)
total_length = len(sequence)
self.assertGreater(
total_length, 4, "Issue with the testing sequence, please update it, it's too short"
)
# Test with max model input length
model_max_length = tokenizer.model_max_length
self.assertEqual(model_max_length, 100)
seq_1 = seq_0 * model_max_length
sequence1 = tokenizer(seq_1, add_special_tokens=False)
total_length1 = len(sequence1["input_ids"])
self.assertGreater(
total_length1,
model_max_length,
"Issue with the testing sequence, please update it, it's too short",
)
# Simple
padding_strategies = (
[False, True, "longest"] if tokenizer.pad_token and tokenizer.pad_token_id >= 0 else [False]
)
for padding_state in padding_strategies:
with self.subTest(f"Padding: {padding_state}"):
for truncation_state in [True, "longest_first", "only_first"]:
with self.subTest(f"Truncation: {truncation_state}"):
output = tokenizer(seq_1, padding=padding_state, truncation=truncation_state)
self.assertEqual(len(output["input_ids"]), model_max_length)
output = tokenizer([seq_1], padding=padding_state, truncation=truncation_state)
self.assertEqual(len(output["input_ids"][0]), model_max_length)
# Simple with no truncation
# Reset warnings
tokenizer.deprecation_warnings = {}
with self.assertLogs("transformers", level="WARNING") as cm:
output = tokenizer(seq_1, padding=padding_state, truncation=False)
self.assertNotEqual(len(output["input_ids"]), model_max_length)
self.assertEqual(len(cm.records), 1)
self.assertTrue(
cm.records[0].message.startswith(
"Token indices sequence length is longer than the specified maximum sequence length"
" for this model"
)
)
tokenizer.deprecation_warnings = {}
with self.assertLogs("transformers", level="WARNING") as cm:
output = tokenizer([seq_1], padding=padding_state, truncation=False)
self.assertNotEqual(len(output["input_ids"][0]), model_max_length)
self.assertEqual(len(cm.records), 1)
self.assertTrue(
cm.records[0].message.startswith(
"Token indices sequence length is longer than the specified maximum sequence length"
" for this model"
)
)
# Overflowing tokens
stride = 2
# modify padding because it's activated by default in seamlessM4T
information = tokenizer(
seq_0,
max_length=total_length - 2,
add_special_tokens=False,
stride=stride,
truncation="longest_first",
return_overflowing_tokens=True,
padding=False,
# add_prefix_space=False,
)
# Overflowing tokens are handled quite differently in slow and fast tokenizers
if isinstance(tokenizer, PreTrainedTokenizerFast):
truncated_sequence = information["input_ids"][0]
overflowing_tokens = information["input_ids"][1]
self.assertEqual(len(information["input_ids"]), 2)
self.assertEqual(len(truncated_sequence), total_length - 2)
self.assertEqual(truncated_sequence, sequence[:-2])
self.assertEqual(len(overflowing_tokens), 2 + stride)
self.assertEqual(overflowing_tokens, sequence[-(2 + stride) :])
else:
truncated_sequence = information["input_ids"]
overflowing_tokens = information["overflowing_tokens"]
self.assertEqual(len(truncated_sequence), total_length - 2)
self.assertEqual(truncated_sequence, sequence[:-2])
self.assertEqual(len(overflowing_tokens), 2 + stride)
self.assertEqual(overflowing_tokens, sequence[-(2 + stride) :])
@unittest.skip("By defaults, uses pad_to_multiple_of which breaks the test")
def test_maximum_encoding_length_pair_input(self):
pass
def test_padding_to_multiple_of(self):
tokenizers = self.get_tokenizers()
for tokenizer in tokenizers:
with self.subTest(f"{tokenizer.__class__.__name__}"):
if tokenizer.pad_token is None:
self.skipTest("No padding token.")
else:
empty_tokens = tokenizer("", padding=True, pad_to_multiple_of=8)
normal_tokens = tokenizer("This is a sample input", padding=True, pad_to_multiple_of=8)
for key, value in empty_tokens.items():
self.assertEqual(len(value) % 8, 0, f"BatchEncoding.{key} is not multiple of 8")
for key, value in normal_tokens.items():
self.assertEqual(len(value) % 8, 0, f"BatchEncoding.{key} is not multiple of 8")
# default to padding=True so need to precise which padding is called
normal_tokens = tokenizer("This", pad_to_multiple_of=8, padding=False)
for key, value in normal_tokens.items():
self.assertNotEqual(len(value) % 8, 0, f"BatchEncoding.{key} is not multiple of 8")
# Should also work with truncation
normal_tokens = tokenizer("This", padding=True, truncation=True, pad_to_multiple_of=8)
for key, value in normal_tokens.items():
self.assertEqual(len(value) % 8, 0, f"BatchEncoding.{key} is not multiple of 8")
# truncation to something which is not a multiple of pad_to_multiple_of raises an error
self.assertRaises(
ValueError,
tokenizer.__call__,
"This",
padding=True,
truncation=True,
max_length=12,
pad_to_multiple_of=8,
)
@require_torch
def test_prepare_seq2seq_batch(self):
if not self.test_seq2seq:
return
tokenizers = self.get_tokenizers()
for tokenizer in tokenizers:
with self.subTest(f"{tokenizer.__class__.__name__}"):
# Longer text that will definitely require truncation.
src_text = [
" UN Chief Says There Is No Military Solution in Syria",
" Secretary-General Ban Ki-moon says his response to Russia's stepped up military support for"
" Syria is that 'there is no military solution' to the nearly five-year conflict and more weapons"
" will only worsen the violence and misery for millions of people.",
]
tgt_text = [
"Şeful ONU declară că nu există o soluţie militară în Siria",
"Secretarul General Ban Ki-moon declară că răspunsul său la intensificarea sprijinului militar al"
' Rusiei pentru Siria este că "nu există o soluţie militară" la conflictul de aproape cinci ani şi'
" că noi arme nu vor face decât să înrăutăţească violenţele şi mizeria pentru milioane de oameni.",
]
try:
batch = tokenizer.prepare_seq2seq_batch(
src_texts=src_text,
tgt_texts=tgt_text,
max_length=3,
max_target_length=10,
return_tensors="pt",
src_lang="eng",
tgt_lang="ron",
pad_to_multiple_of=None,
)
except NotImplementedError:
return
self.assertEqual(batch.input_ids.shape[1], 3)
self.assertEqual(batch.labels.shape[1], 10)
# TODO: not working for tgt_text
# max_target_length will default to max_length if not specified
batch = tokenizer.prepare_seq2seq_batch(
src_texts=src_text,
tgt_texts=tgt_text,
max_length=4,
return_tensors="pt",
pad_to_multiple_of=None,
)
self.assertEqual(batch.input_ids.shape[1], 4)
self.assertEqual(batch.labels.shape[1], 4)
batch_encoder_only = tokenizer.prepare_seq2seq_batch(
src_texts=src_text,
max_length=4,
max_target_length=10,
return_tensors="pt",
pad_to_multiple_of=None,
)
self.assertEqual(batch_encoder_only.input_ids.shape[1], 4)
self.assertEqual(batch_encoder_only.attention_mask.shape[1], 4)
self.assertNotIn("decoder_input_ids", batch_encoder_only)
@unittest.skip("Unfortunately way too slow to build a BPE with SentencePiece.")
def test_save_slow_from_fast_and_reload_fast(self):
pass
# Copied from tests.models.nllb.test_tokenization_nllb.NllbTokenizationTest.test_special_tokens_initialization
def test_special_tokens_initialization(self):
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
added_tokens = [AddedToken("<special>", lstrip=True)]
tokenizer_r = self.rust_tokenizer_class.from_pretrained(
pretrained_name, additional_special_tokens=added_tokens, **kwargs
)
r_output = tokenizer_r.encode("Hey this is a <special> token")
special_token_id = tokenizer_r.encode("<special>", add_special_tokens=False)[0]
self.assertTrue(special_token_id in r_output)
if self.test_slow_tokenizer:
tokenizer_cr = self.rust_tokenizer_class.from_pretrained(
pretrained_name,
additional_special_tokens=added_tokens,
**kwargs, # , from_slow=True <- unfortunately too slow to convert
)
tokenizer_p = self.tokenizer_class.from_pretrained(
pretrained_name, additional_special_tokens=added_tokens, **kwargs
)
p_output = tokenizer_p.encode("Hey this is a <special> token")
cr_output = tokenizer_cr.encode("Hey this is a <special> token")
self.assertEqual(p_output, r_output)
self.assertEqual(cr_output, r_output)
self.assertTrue(special_token_id in p_output)
self.assertTrue(special_token_id in cr_output)
@unittest.skip(
"encode_plus and batch_encode_plus are deprecated and __call__ do some processing, so we expect different results."
)
def test_call(self):
pass
def test_training_new_tokenizer(self):
# This feature only exists for fast tokenizers
if not self.test_rust_tokenizer:
return
tokenizer = self.get_rust_tokenizer()
new_tokenizer = tokenizer.train_new_from_iterator(SMALL_TRAINING_CORPUS, 100)
# Test we can use the new tokenizer with something not seen during training
inputs = new_tokenizer(["This is the first sentence", "This sentence is different 🤗."])
self.assertEqual(len(inputs["input_ids"]), 2)
decoded_input = new_tokenizer.decode(inputs["input_ids"][0], skip_special_tokens=True)
expected_result = "This is the first sentence"
if tokenizer.backend_tokenizer.normalizer is not None:
expected_result = tokenizer.backend_tokenizer.normalizer.normalize_str(expected_result)
self.assertEqual(expected_result, decoded_input)
# We check that the parameters of the tokenizer remained the same
# Check we have the same number of added_tokens for both pair and non-pair inputs.
# make sure it has the same prefix tokens first
new_tokenizer.tgt_lang = tokenizer.tgt_lang
tokenizer.tgt_lang = tokenizer.tgt_lang
self.assertEqual(tokenizer.num_special_tokens_to_add(False), new_tokenizer.num_special_tokens_to_add(False))
self.assertEqual(tokenizer.num_special_tokens_to_add(True), new_tokenizer.num_special_tokens_to_add(True))
# Check we have the correct max_length for both pair and non-pair inputs.
self.assertEqual(tokenizer.max_len_single_sentence, new_tokenizer.max_len_single_sentence)
self.assertEqual(tokenizer.max_len_sentences_pair, new_tokenizer.max_len_sentences_pair)
# Assert the set of special tokens match as we didn't ask to change them
self.assertSequenceEqual(
tokenizer.all_special_tokens_extended,
new_tokenizer.all_special_tokens_extended,
)
self.assertDictEqual(tokenizer.special_tokens_map, new_tokenizer.special_tokens_map)
@unittest.skip("Fails because of the hack of adding <unk> in _tokenize")
def test_pickle_subword_regularization_tokenizer(self):
pass
@unittest.skip("Fails because of the hack of adding <unk> in _tokenize")
def test_subword_regularization_tokenizer(self):
pass
@require_torch
@require_sentencepiece
@require_tokenizers
class SeamlessM4TDistilledIntegrationTest(unittest.TestCase):
checkpoint_name = "facebook/hf-seamless-m4t-medium"
src_text = [
" UN Chief Says There Is No Military Solution in Syria",
""" Secretary-General Ban Ki-moon says his response to Russia's stepped up military support for Syria is that "there is no military solution" to the nearly five-year conflict and more weapons will only worsen the violence and misery for millions of people.""",
]
tgt_text = [
"Şeful ONU declară că nu există o soluţie militară în Siria",
"Secretarul General Ban Ki-moon declară că răspunsul său la intensificarea sprijinului militar al Rusiei"
' pentru Siria este că "nu există o soluţie militară" la conflictul de aproape cinci ani şi că noi arme nu vor'
" face decât să înrăutăţească violenţele şi mizeria pentru milioane de oameni.",
]
# fmt: off
expected_src_tokens = [256047, 16297, 134408, 8165, 248066, 14734, 950, 1135, 105721, 3573, 83, 27352, 108, 49486, 3]
# fmt: on
@classmethod
def setUpClass(cls):
cls.tokenizer: SeamlessM4TTokenizer = SeamlessM4TTokenizer.from_pretrained(
cls.checkpoint_name, src_lang="eng", tgt_lang="ron"
)
# cls.pad_token_id = 1
return cls
def test_language_codes(self):
self.assertEqual(self.tokenizer.convert_tokens_to_ids("__ace_Latn__"), 256002)
self.assertEqual(self.tokenizer.convert_tokens_to_ids("__shn__"), 256152)
self.assertEqual(self.tokenizer.convert_tokens_to_ids("__eng__"), 256047)
self.assertEqual(self.tokenizer.convert_tokens_to_ids("__fra__"), 256057)
self.assertEqual(self.tokenizer.convert_tokens_to_ids("__quy__"), 256144)
def test_tokenizer_tgt_lang(self):
ids = self.tokenizer(self.src_text, src_lang="fra").input_ids[0]
self.assertListEqual(self.expected_src_tokens[1:], ids[1 : len(self.expected_src_tokens)])
self.assertEqual(256057, ids[0])
rest_ids = ids[len(self.expected_src_tokens) :]
self.assertListEqual([0] * len(rest_ids), rest_ids)
ids = self.tokenizer(self.src_text, src_lang="__shn__").input_ids[0]
self.assertListEqual(self.expected_src_tokens[1:], ids[1 : len(self.expected_src_tokens)])
self.assertEqual(256152, ids[0])
# Copied from tests.models.nllb.test_tokenization_nllb.NllbDistilledIntegrationTest.test_enro_tokenizer_decode_ignores_language_codes
def test_enro_tokenizer_decode_ignores_language_codes(self):
self.assertIn(RO_CODE, self.tokenizer.all_special_ids)
# fmt: off
generated_ids = [RO_CODE, 4254, 98068, 112923, 39072, 3909, 713, 102767, 26, 17314, 35642, 14683, 33118, 2022, 66987, 2, 256047]
# fmt: on
result = self.tokenizer.decode(generated_ids, skip_special_tokens=True)
expected_romanian = self.tokenizer.decode(generated_ids[1:], skip_special_tokens=True)
self.assertEqual(result, expected_romanian)
self.assertNotIn(self.tokenizer.eos_token, result)
def test_enro_tokenizer_truncation(self):
src_text = ["this is gunna be a long sentence " * 20]
assert isinstance(src_text[0], str)
desired_max_length = 10
ids = self.tokenizer(src_text, max_length=desired_max_length, truncation=True).input_ids[0]
self.assertEqual(ids[-1], 3)
self.assertEqual(ids[0], EN_CODE)
self.assertEqual(len(ids), desired_max_length)
# Copied from tests.models.nllb.test_tokenization_nllb.NllbDistilledIntegrationTest.test_special_tokens_unaffacted_by_save_load with fairseq_tokens_to_ids->additional_special_tokens, Nllb->SeamlessM4T, Dict->List
def test_special_tokens_unaffacted_by_save_load(self):
tmpdirname = tempfile.mkdtemp()
original_special_tokens = self.tokenizer.additional_special_tokens
self.tokenizer.save_pretrained(tmpdirname)
new_tok = SeamlessM4TTokenizer.from_pretrained(tmpdirname)
self.assertListEqual(new_tok.additional_special_tokens, original_special_tokens)
@require_torch
def test_enro_tokenizer_prepare_batch(self):
batch = self.tokenizer(
self.src_text,
text_target=self.tgt_text,
padding=True,
truncation=True,
max_length=len(self.expected_src_tokens),
pad_to_multiple_of=None,
return_tensors="pt",
)
batch["decoder_input_ids"] = shift_tokens_right(
batch["labels"], self.tokenizer.pad_token_id, self.tokenizer.convert_tokens_to_ids("__ron__")
)
self.assertIsInstance(batch, BatchEncoding)
self.assertEqual((2, 15), batch.input_ids.shape)
self.assertEqual((2, 15), batch.attention_mask.shape)
result = batch.input_ids.tolist()[0]
self.assertListEqual(self.expected_src_tokens, result)
self.assertEqual(RO_CODE, batch.decoder_input_ids[0, 0]) # EOS
# Test that special tokens are reset
self.assertEqual(self.tokenizer.prefix_tokens, [EN_CODE])
self.assertEqual(self.tokenizer.suffix_tokens, [self.tokenizer.eos_token_id])
def test_seq2seq_max_length(self):
batch = self.tokenizer(
self.src_text, padding=True, truncation=True, max_length=3, return_tensors="pt", pad_to_multiple_of=None
)
targets = self.tokenizer(
text_target=self.tgt_text, padding=True, truncation=True, max_length=10, return_tensors="pt"
)
labels = targets["input_ids"]
batch["decoder_input_ids"] = shift_tokens_right(
labels,
self.tokenizer.pad_token_id,
decoder_start_token_id=self.tokenizer.convert_tokens_to_ids(self.tokenizer.tgt_lang),
)
self.assertEqual(batch.input_ids.shape[1], 3)
self.assertEqual(batch.decoder_input_ids.shape[1], 10)
@require_torch
def test_tokenizer_translation(self):
inputs = self.tokenizer._build_translation_inputs(
"A test", return_tensors="pt", src_lang="eng", tgt_lang="fra"
)
self.assertEqual(
nested_simplify(inputs),
{
# A, test, EOS, en_XX
"input_ids": [[256047, 70, 7356, 3]],
"attention_mask": [[1, 1, 1, 1]],
# ar_AR
"forced_bos_token_id": 256057,
},
)
@require_sentencepiece
@require_tokenizers
class CommonSpmIntegrationTests(unittest.TestCase):
"""
A class that regroups important test to make sure that we properly handle the special tokens.
"""
@classmethod
def setUpClass(cls):
tokenizer = SeamlessM4TTokenizer(SAMPLE_VOCAB, extra_ids=0, add_bos_token=False, legacy=False)
tokenizer.add_special_tokens({"additional_special_tokens": [AddedToken("<s>", rstrip=False, lstrip=False)]})
cls.tokenizer = tokenizer
return cls
def test_add_dummy_prefix(self):
# make sure `'▁'` is prepended, and outputs match sp_model's
# `sentencepiece.NormalizerSpec.add_dummy_prefix` attribute
input_ids = self.tokenizer.encode(". Hello")
self.assertEqual(input_ids, [3, 1, 8, 5, 157, 87, 21, 3])
sp_encode = self.tokenizer.sp_model.encode(". Hello")
# [bos, lang_id, _] + offset_sp_encode
self.assertEqual(input_ids[:-1], [3, 1, 8] + [i + self.tokenizer.fairseq_offset for i in sp_encode])
tokens = self.tokenizer.tokenize(". Hello")
self.assertEqual(tokens, ["▁", ".", "▁He", "ll", "o"])
tokens = self.tokenizer.tokenize("")
self.assertEqual(tokens, [])
self.assertEqual(tokens, self.tokenizer.sp_model.encode("", out_type=str))
tokens = self.tokenizer.tokenize(" ")
self.assertEqual(tokens, [])
self.assertEqual(tokens, self.tokenizer.sp_model.encode(" ", out_type=str))
tokens = self.tokenizer.tokenize("▁")
self.assertEqual(tokens, [])
self.assertEqual(tokens, self.tokenizer.sp_model.encode("▁", out_type=str))
def test_remove_extra_whitespaces(self):
# make sure the extra spaces are eaten. Since the sample vocab does not have
# `______`. sentencepiece.NormalizerSpec.remove_extra_whitespaces attribute is set to False
input_ids = self.tokenizer.encode(" . Hello")
self.assertEqual(input_ids, [3, 1, 8, 5, 157, 87, 21, 3])
sp_encode = self.tokenizer.sp_model.encode(" . Hello")
self.assertEqual([i - self.tokenizer.fairseq_offset for i in input_ids[2:-1]], [7] + sp_encode)
tokens = self.tokenizer.tokenize(" . Hello")
self.assertEqual(tokens, ["▁", ".", "▁He", "ll", "o"])
# `'▁'` is also a whitespace
input_ids = self.tokenizer.encode("▁He is not")
self.assertEqual(input_ids, [3, 1, 157, 47, 45, 3])
tokens = self.tokenizer.tokenize("▁He is not")
sp_encode = [
self.tokenizer.sp_model.piece_to_id("▁He"),
self.tokenizer.sp_model.piece_to_id("▁is"),
self.tokenizer.sp_model.piece_to_id("▁not"),
]
self.assertEqual([i - self.tokenizer.fairseq_offset for i in input_ids[2:-1]], sp_encode)
self.assertEqual(tokens, ["▁He", "▁is", "▁not"]) # no extra space added
input_ids = self.tokenizer.encode("▁He is not<s> ▁He")
self.assertEqual(input_ids, [3, 1, 157, 47, 45, 2, 157, 3])
tokens = self.tokenizer.tokenize("▁He is not<s> ▁He")
self.assertEqual(tokens, ["▁He", "▁is", "▁not", "<s>", "▁He"]) # spaces are eaten by spm + our strip
# make sure that the output after the extra id is the same as if
# extra_id was not there
input_ids = self.tokenizer.encode("▁He is not ▁He")
self.assertEqual(input_ids, [3, 1, 157, 47, 45, 157, 3])
tokens = self.tokenizer.tokenize("▁He is not ▁He")
self.assertEqual(tokens, ["▁He", "▁is", "▁not", "▁He"]) # spaces are eaten by spm even if not start
def test_character_after_special_token(self):
# Make sure that `tokenizer.tokenize` is similar to
# adding the equivalent special token to the vocab
input_ids = self.tokenizer.encode("Hey <s>I")
self.assertEqual(input_ids, [3, 1, 157, 31, 2, 101, 3])
sp_encode = self.tokenizer.sp_model.encode("Hey .I")
# the last token besides eos should be 100 offset
self.assertEqual(input_ids[-2] - self.tokenizer.fairseq_offset, sp_encode[-1])
tokens = self.tokenizer.tokenize("<s>I")
self.assertEqual(tokens, ["<s>", "I"])
input_ids = self.tokenizer.encode("Hello, <s>,")
self.assertEqual(input_ids, [3, 1, 157, 87, 21, 4, 2, 4, 3])
tokens = self.tokenizer.tokenize("Hello, <s>,")
self.assertEqual(tokens, ["▁He", "ll", "o", ",", "<s>", ","])
def test_special_tokens_strip(self):
input_ids = self.tokenizer.encode(" <s> ,")
self.assertEqual(input_ids, [3, 1, 2, 8, 4, 3])
tokens = self.tokenizer.tokenize(" <s> ,")
# spaces are eaten by rstrip / lstrip + spm sp_model.encode(" ") = []
self.assertEqual(tokens, ["<s>", "▁", ","])
input_ids = self.tokenizer.encode("No <s> ▁He")
self.assertEqual(input_ids, [3, 1, 285, 2, 157, 3])
tokens = self.tokenizer.tokenize("No <s> ▁He")
self.assertEqual(tokens, ["▁No", "<s>", "▁He"]) # spaces are eaten by rstrip / lstrip
...@@ -84,6 +84,18 @@ SPECIAL_CASES_TO_ALLOW = { ...@@ -84,6 +84,18 @@ SPECIAL_CASES_TO_ALLOW = {
"ClapAudioConfig": ["num_classes"], "ClapAudioConfig": ["num_classes"],
# Not used, but providing useful information to users # Not used, but providing useful information to users
"SpeechT5HifiGanConfig": ["sampling_rate"], "SpeechT5HifiGanConfig": ["sampling_rate"],
# Actually used in the config or generation config, in that case necessary for the sub-components generation
"SeamlessM4TConfig": [
"max_new_tokens",
"t2u_max_new_tokens",
"t2u_decoder_attention_heads",
"t2u_decoder_ffn_dim",
"t2u_decoder_layers",
"t2u_encoder_attention_heads",
"t2u_encoder_ffn_dim",
"t2u_encoder_layers",
"t2u_max_position_embeddings",
],
} }
......
...@@ -463,6 +463,7 @@ OBJECTS_TO_IGNORE = [ ...@@ -463,6 +463,7 @@ OBJECTS_TO_IGNORE = [
"SEWForCTC", "SEWForCTC",
"SamConfig", "SamConfig",
"SamPromptEncoderConfig", "SamPromptEncoderConfig",
"SeamlessM4TConfig", # use of unconventional markdown
"Seq2SeqTrainingArguments", "Seq2SeqTrainingArguments",
"SpecialTokensMixin", "SpecialTokensMixin",
"Speech2Text2Config", "Speech2Text2Config",
......
...@@ -112,6 +112,9 @@ IGNORE_NON_TESTED = PRIVATE_MODELS.copy() + [ ...@@ -112,6 +112,9 @@ IGNORE_NON_TESTED = PRIVATE_MODELS.copy() + [
"BridgeTowerVisionModel", # No need to test it as it is tested by BridgeTowerModel model. "BridgeTowerVisionModel", # No need to test it as it is tested by BridgeTowerModel model.
"BarkCausalModel", # Building part of bigger (tested) model. "BarkCausalModel", # Building part of bigger (tested) model.
"BarkModel", # Does not have a forward signature - generation tested with integration tests "BarkModel", # Does not have a forward signature - generation tested with integration tests
"SeamlessM4TTextToUnitModel", # Building part of bigger (tested) model.
"SeamlessM4TCodeHifiGan", # Building part of bigger (tested) model.
"SeamlessM4TTextToUnitForConditionalGeneration", # Building part of bigger (tested) model.
] ]
# Update this list with test files that don't have a tester with a `all_model_classes` variable and which don't # Update this list with test files that don't have a tester with a `all_model_classes` variable and which don't
...@@ -281,6 +284,10 @@ IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [ ...@@ -281,6 +284,10 @@ IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
"SpeechT5ForTextToSpeech", "SpeechT5ForTextToSpeech",
"SpeechT5HifiGan", "SpeechT5HifiGan",
"VitMatteForImageMatting", "VitMatteForImageMatting",
"SeamlessM4TTextToUnitModel",
"SeamlessM4TTextToUnitForConditionalGeneration",
"SeamlessM4TCodeHifiGan",
"SeamlessM4TForSpeechToSpeech", # no auto class for speech-to-speech
] ]
# DO NOT edit this list! # DO NOT edit this list!
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment