"Mid-atlantic united states english,philadelphia, pennsylvania, united states english,united states english,philadelphia style united states english":"American",
"Mid-atlantic,england english,united states english":"American",
"Midatlantic,england english":"American",
"Midwestern states (michigan),united states english":"American",
"Mild northern england english":"English",
"Minor french accent":"French",
"Mix of american and british ,native polish":"Polish",
"Mix of american and british accent":"Unknown",# Combination not clearly mapped
"Mostly american with some british and australian inflections":"Unknown",# Combination not clearly mapped
"My accent is influenced by the phones of all letters within a sentence.,southern african (south africa, zimbabwe, namibia)":"South african",
"New zealand english":"New Zealand English",
"Nigeria english":"Nigerian",# Note: added new
"Non native speaker from france":"French",
"Non-native":"Unknown",# Too vague
"Non-native,german accent":"German",
"North european english":"Unknown",# Too broad
"Norwegian":"Norwegian",# Note: added new
"Ontario,canadian english":"Canadian",# Note: added new
"Polish english":"Polish",
"Rhode island new england accent":"American",
"Singaporean english":"Singaporean",# Note: added new
"Slavic":"Eastern european",
"Slighty southern affected by decades in the midwest, 4 years in spain and germany, speak some german, spanish, polish. have lived in nine states.":"Unknown",# Complex blend
"South african":"South african",
"South atlantic (falkland islands, saint helena)":"Unknown",# Specific regions not listed
"South australia":"Australian",
"South indian":"Indian",
"Southern drawl":"American",
"Southern texas accent,united states english":"American",
"Southern united states,united states english":"American",
"Spanish bilingual":"Spanish",
"Spanish,foreign,non-native":"Spanish",
"Strong latvian accent":"Latvian",
"Swedish accent":"Swedish",# Note: added new
"Transnational englishes blend":"Unknown",# Too vague
"U.k. english":"English",
"Very slight russian accent,standard american english,boston influence":"American",
"Welsh english":"Welsh",
"West african":"Unknown",# No specific West African category
"West indian":"Unknown",# Caribbean, but no specific match
"Western europe":"Unknown",# Too broad
"With heavy cantonese accent":"Chinese",
}
defpreprocess_labels(label:str)->str:
"""Apply pre-processing formatting to the accent labels"""
if"_"inlabel:
# voxpopuli stylises the accent as a language code (e.g. en_pl for "polish") - convert to full accent
language_code=label.split("_")[-1]
label=LANGUAGES[language_code]
# VCTK labels for two words are concatenated into one (NewZeleand-> New Zealand)
Arguments pertaining to what data we are going to input our model for training and eval.
Using `HfArgumentParser` we can turn this class
into argparse arguments to be able to specify them on
the command line.
"""
train_dataset_name:str=field(
default=None,
metadata={
"help":"The name of the training dataset to use (via the datasets library). Load and combine "
"multiple datasets by separating dataset ids by a '+' symbol. For example, to load and combine "
" librispeech and common voice, set `train_dataset_name='librispeech_asr+common_voice'`."
},
)
train_dataset_config_name:Optional[str]=field(
default=None,
metadata={
"help":"The configuration name of the training dataset to use (via the datasets library). Load and combine "
"multiple datasets by separating dataset configs by a '+' symbol."
},
)
train_split_name:str=field(
default="train",
metadata={
"help":("The name of the training data set split to use (via the datasets library). Defaults to 'train'")
},
)
train_dataset_samples:str=field(
default=None,
metadata={
"help":"Number of samples in the training data. Load and combine "
"multiple datasets by separating dataset samples by a '+' symbol."
},
)
eval_dataset_name:str=field(
default=None,
metadata={
"help":"The name of the evaluation dataset to use (via the datasets library). Defaults to the training dataset name if unspecified."
},
)
eval_dataset_config_name:Optional[str]=field(
default=None,
metadata={
"help":"The configuration name of the evaluation dataset to use (via the datasets library). Defaults to the training dataset config name if unspecified"
},
)
eval_split_name:str=field(
default="validation",
metadata={
"help":(
"The name of the evaluation data set split to use (via the datasets"
" library). Defaults to 'validation'"
)
},
)
audio_column_name:str=field(
default="audio",
metadata={"help":"The name of the dataset column containing the audio data. Defaults to 'audio'"},
)
train_label_column_name:str=field(
default="labels",
metadata={
"help":"The name of the dataset column containing the labels in the train set. Defaults to 'label'"
},
)
eval_label_column_name:str=field(
default="labels",
metadata={"help":"The name of the dataset column containing the labels in the eval set. Defaults to 'label'"},
)
max_train_samples:Optional[int]=field(
default=None,
metadata={
"help":(
"For debugging purposes or quicker training, truncate the number of training examples to this "
"value if set."
)
},
)
max_eval_samples:Optional[int]=field(
default=None,
metadata={
"help":(
"For debugging purposes or quicker training, truncate the number of evaluation examples to this "
"value if set."
)
},
)
max_length_seconds:Optional[float]=field(
default=20,
metadata={"help":"Audio samples will be randomly cut to this length during training if the value is set."},
)
min_length_seconds:Optional[float]=field(
default=5,
metadata={"help":"Audio samples less than this value will be filtered during training if the value is set."},
)
preprocessing_num_workers:Optional[int]=field(
default=None,
metadata={"help":"The number of processes to use for the preprocessing."},
)
filter_threshold:Optional[float]=field(
default=1.0,
metadata={"help":"Filter labels that occur less than `filter_threshold` percent in the training/eval data."},
)
max_samples_per_label:Optional[int]=field(
default=None,
metadata={
"help":(
"If set, randomly limits the number of samples per label."
)
},
)
save_to_disk:str=field(
default=None,
metadata={
"help":"If set, will save the dataset to this path if this is an empyt folder. If not empty, will load the datasets from it."
}
)
temporary_save_to_disk:str=field(
default=None,
metadata={
"help":"Temporarily save audio labels here."
}
)
@dataclass
classModelArguments:
"""
Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
"""
model_name_or_path:str=field(
default="facebook/wav2vec2-base",
metadata={"help":"Path to pretrained model or model identifier from huggingface.co/models. Only works with Wav2Vec2 models"},
)
config_name:Optional[str]=field(
default=None,metadata={"help":"Pretrained config name or path if not the same as model_name"}
)
cache_dir:Optional[str]=field(
default=None,metadata={"help":"Where do you want to store the pretrained models downloaded from the Hub"}
)
model_revision:str=field(
default="main",
metadata={"help":"The specific model version to use (can be a branch name, tag name or commit id)."},
)
feature_extractor_name:Optional[str]=field(
default=None,metadata={"help":"Name or path of preprocessor config."}
)
freeze_feature_encoder:bool=field(
default=False,
metadata={
"help":"Whether to freeze the feature encoder layers of the model. Only relevant for Wav2Vec2-style models."
},
)
freeze_base_model:bool=field(
default=True,metadata={"help":"Whether to freeze the base encoder of the model."}
)
attention_mask:bool=field(
default=True,metadata={"help":"Whether to generate an attention mask in the feature extractor."}
)
token:str=field(
default=None,
metadata={
"help":(
"The token to use as HTTP bearer authorization for remote files. If not specified, will use the token "
"generated when running `huggingface-cli login` (stored in `~/.huggingface`)."
)
},
)
trust_remote_code:bool=field(
default=False,
metadata={
"help":(
"Whether or not to allow for custom models defined on the Hub in their own modeling files. This option "
"should only be set to `True` for repositories you trust and in which you have read the code, as it will "
"execute code present on the Hub on your local machine."
)
},
)
ignore_mismatched_sizes:bool=field(
default=True,
metadata={"help":"Will enable to load a pretrained model whose head dimensions are different."},
)
attention_dropout:float=field(
default=0.0,metadata={"help":"The dropout ratio for the attention probabilities."}
)
activation_dropout:float=field(
default=0.0,metadata={"help":"The dropout ratio for activations inside the fully connected layer."}
)
feat_proj_dropout:float=field(default=0.0,metadata={"help":"The dropout ratio for the projected features."})
hidden_dropout:float=field(
default=0.0,
metadata={
"help":"The dropout probability for all fully connected layers in the embeddings, encoder, and pooler."
},
)
final_dropout:float=field(
default=0.0,
metadata={"help":"The dropout probability for the final projection layer."},
)
mask_time_prob:float=field(
default=0.05,
metadata={
"help":(
"Probability of each feature vector along the time axis to be chosen as the start of the vector "
"span to be masked. Approximately ``mask_time_prob * sequence_length // mask_time_length`` feature "
"vectors will be masked along the time axis."
)
},
)
mask_time_length:int=field(
default=10,
metadata={"help":"Length of vector span to mask along the time axis."},
)
mask_feature_prob:float=field(
default=0.0,
metadata={
"help":(
"Probability of each feature vector along the feature axis to be chosen as the start of the vectorspan"
" to be masked. Approximately ``mask_feature_prob * sequence_length // mask_feature_length`` feature"
" bins will be masked along the time axis."
)
},
)
mask_feature_length:int=field(
default=10,
metadata={"help":"Length of vector span to mask along the feature axis."},
remove_columns=[colforcolinnext(iter(raw_datasets.values())).column_namesifcol!="labels"],# this is a trick to avoid to rewrite the entire audio column which takes ages
logger.info(f"Dataset saved at {data_args.save_to_disk}. Be careful of changing data parameters, which won't change the current saved dataset if reloaded.")
PROMPT="""You will be given six descriptive keywords related to an audio sample of a person's speech. These keywords include:
1. The gender (e.g., male, female)
2. The level of reverberation (e.g., very roomy sounding, quite roomy sounding, slightly roomy sounding, moderate reverberation, slightly confined sounding, quite confined sounding, very confined sounding)
3. The amount of noise the sample (e.g., very noisy, quite noisy, slightly noisy, moderate ambient sound, slightly clear, quite clear, very clear)
4. The tone of the speaker's voice (e.g., very monotone, quite monotone, slightly monotone, moderate intonation, slightly expressive, quite expressive, very expressive)
5. The pace of the speaker's delivery (e.g., very slowly, quite slowly, slightly slowly, moderate speed, slightly fast, quite fast, very fast)
6. The pitch of the speaker's voice (e.g., very low pitch, quite low pitch, slightly low pitch, moderate pitch, slightly high pitch, quite high pitch, very high pitch)
Your task is to create a text description using these keywords that accurately describes the speech sample while ensuring the description remains grammatically correct and easy to understand. You should rearrange the keyword order as necessary, and substitute synonymous terms where appropriate. If the amount of noise is 'very noisy' and the level of reverberation is 'very roomy sounding', include terms like 'very bad recording' in the description. Likewise, if the amount of noise is 'very clear' and the level of reverberation is 'very confined sounding', include terms like 'very good recording' in the description. Otherwise, do not add extra details beyond what has been provided, and only return the generated description.
For example, given the following keywords: 'female', 'slightly roomy sounding', 'slightly noisy', 'very expressive', 'slightly low pitch', 'very slowly', a valid description would be: 'a woman with a deep voice speaks slowly but has an animated delivery in an echoey room with some background noise'.
For the keywords: '[gender]', '[reverberation]', '[noise]', '[speech_monotony]', '[pitch]', '[speaking_rate]', the corresponding description is:"