"docs/source/vscode:/vscode.git/clone" did not exist on "5db9abde439bc02c3791da2a4fefee80d94d5b96"
Unverified Commit 42137280 authored by Anton Lozhkov's avatar Anton Lozhkov Committed by GitHub
Browse files

[Examples] Add an official audio classification example (#13722)



* Restore broken merge

* Additional args, DDP, remove CommonLanguage

* Update examples for V100, add training results

* Style

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Remove custom datasets for simplicity, apply suggestions from code review

* Add the attention_mask flag, reorganize README
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
parent c4113721
<!---
Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Audio classification examples
The following examples showcase how to fine-tune `Wav2Vec2` for audio classification using PyTorch.
Speech recognition models that have been pretrained in unsupervised fashion on audio data alone,
*e.g.* [Wav2Vec2](https://huggingface.co/transformers/master/model_doc/wav2vec2.html),
[HuBERT](https://huggingface.co/transformers/master/model_doc/hubert.html),
[XLSR-Wav2Vec2](https://huggingface.co/transformers/master/model_doc/xlsr_wav2vec2.html), have shown to require only
very little annotated data to yield good performance on speech classification datasets.
## Single-GPU
The following command shows how to fine-tune [wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base) on the 🗣️ [Keyword Spotting subset](https://huggingface.co/datasets/superb#ks) of the SUPERB dataset.
```bash
python run_audio_classification.py \
--model_name_or_path facebook/wav2vec2-base \
--dataset_name superb \
--dataset_config_name ks \
--output_dir wav2vec2-base-keyword-spotting \
--overwrite_output_dir \
--remove_unused_columns False \
--do_train \
--do_eval \
--fp16 \
--learning_rate 3e-5 \
--max_length_seconds 1 \
--warmup_ratio 0.1 \
--num_train_epochs 5 \
--per_device_train_batch_size 32 \
--gradient_accumulation_steps 4 \
--per_device_eval_batch_size 32 \
--dataloader_num_workers 4 \
--logging_strategy steps \
--logging_steps 10 \
--evaluation_strategy epoch \
--save_strategy epoch \
--load_best_model_at_end True \
--save_total_limit 3 \
--seed 0 \
--push_to_hub
```
On a single V100 GPU (16GB), this script should run in ~10 minutes and yield accuracy of **98.4%**.
👀 See the results here: [anton-l/wav2vec2-base-keyword-spotting](https://huggingface.co/anton-l/wav2vec2-base-keyword-spotting)
## Multi-GPU
The following command shows how to fine-tune [wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base) for 🌎 **Language Identification** on the [CommonLanguage dataset](https://huggingface.co/datasets/anton-l/common_language).
```bash
python run_audio_classification.py \
--model_name_or_path facebook/wav2vec2-base \
--dataset_name anton-l/common_language \
--audio_column_name path \
--label_column_name language \
--output_dir wav2vec2-base-lang-id \
--overwrite_output_dir \
--remove_unused_columns False \
--do_train \
--do_eval \
--fp16 \
--learning_rate 3e-5 \
--max_length_seconds 16 \
--attention_mask False \
--warmup_ratio 0.1 \
--num_train_epochs 10 \
--per_device_train_batch_size 8 \
--gradient_accumulation_steps 4 \
--per_device_eval_batch_size 1 \
--dataloader_num_workers 8 \
--logging_strategy steps \
--logging_steps 10 \
--evaluation_strategy epoch \
--save_strategy epoch \
--load_best_model_at_end True \
--save_total_limit 3 \
--seed 0 \
--push_to_hub
```
On 4 V100 GPUs (16GB), this script should run in ~1 hour and yield accuracy of **79.45%**.
👀 See the results here: [anton-l/wav2vec2-base-lang-id](https://huggingface.co/anton-l/wav2vec2-base-lang-id)
## Sharing your model on 🤗 Hub
0. If you haven't already, [sign up](https://huggingface.co/join) for a 🤗 account
1. Make sure you have `git-lfs` installed and git set up.
```bash
$ apt install git-lfs
```
2. Log in with your HuggingFace account credentials using `huggingface-cli`
```bash
$ huggingface-cli login
# ...follow the prompts
```
3. When running the script, pass the following arguments:
```bash
python run_audio_classification.py \
--push_to_hub \
--hub_model_id <username/model_id> \
...
```
\ No newline at end of file
datasets>=1.12.0
torchaudio
torch>=1.6
\ No newline at end of file
#!/usr/bin/env python
# coding=utf-8
# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
import logging
import os
import sys
from dataclasses import dataclass, field
from random import randint
from typing import Optional
import datasets
import numpy as np
import torchaudio
from datasets import DatasetDict, load_dataset
import transformers
from transformers import (
AutoConfig,
AutoFeatureExtractor,
AutoModelForAudioClassification,
HfArgumentParser,
Trainer,
TrainingArguments,
set_seed,
)
from transformers.trainer_utils import get_last_checkpoint
from transformers.utils import check_min_version
from transformers.utils.versions import require_version
logger = logging.getLogger(__name__)
# Will error if the minimal version of Transformers is not installed. Remove at your own risks.
check_min_version("4.11.0.dev0")
require_version("datasets>=1.12.1", "To fix: pip install -r examples/pytorch/audio-classification/requirements.txt")
def load_audio(path: str, sample_rate: int = 16000):
wav, sr = torchaudio.load(path)
# convert multi-channel audio to mono
wav = wav.mean(0)
# standardize sample rate if it varies in the dataset
resampler = torchaudio.transforms.Resample(sr, sample_rate)
wav = resampler(wav)
return wav
def random_subsample(wav: np.ndarray, max_length: float, sample_rate: int = 16000):
"""Randomly sample chunks of `max_length` seconds from the input audio"""
sample_length = int(round(sample_rate * max_length))
if len(wav) <= sample_length:
return wav
random_offset = randint(0, len(wav) - sample_length - 1)
return wav[random_offset : random_offset + sample_length]
@dataclass
class DataTrainingArguments:
"""
Arguments pertaining to what data we are going to input our model for training and eval.
Using `HfArgumentParser` we can turn this class
into argparse arguments to be able to specify them on
the command line.
"""
dataset_name: Optional[str] = field(default=None, metadata={"help": "Name of a dataset from the datasets package"})
dataset_config_name: Optional[str] = field(
default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
)
train_file: Optional[str] = field(
default=None, metadata={"help": "A file containing the training audio paths and labels."}
)
eval_file: Optional[str] = field(
default=None, metadata={"help": "A file containing the validation audio paths and labels."}
)
train_split_name: Optional[str] = field(
default="train",
metadata={
"help": "The name of the training data set split to use (via the datasets library). Defaults to 'train'"
},
)
eval_split_name: Optional[str] = field(
default="validation",
metadata={
"help": "The name of the training data set split to use (via the datasets library). Defaults to "
"'validation'"
},
)
audio_column_name: Optional[str] = field(
default="file",
metadata={"help": "The name of the dataset column containing the audio data. Defaults to 'file'"},
)
label_column_name: Optional[str] = field(
default="label", metadata={"help": "The name of the dataset column containing the labels. Defaults to 'label'"}
)
max_train_samples: Optional[int] = field(
default=None,
metadata={
"help": "For debugging purposes or quicker training, truncate the number of training examples to this "
"value if set."
},
)
max_eval_samples: Optional[int] = field(
default=None,
metadata={
"help": "For debugging purposes or quicker training, truncate the number of evaluation examples to this "
"value if set."
},
)
max_length_seconds: Optional[float] = field(
default=20,
metadata={"help": "Audio clips will be randomly cut to this length during training if the value is set."},
)
@dataclass
class ModelArguments:
"""
Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
"""
model_name_or_path: str = field(
default="facebook/wav2vec2-base",
metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"},
)
config_name: Optional[str] = field(
default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
)
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from the Hub"}
)
model_revision: str = field(
default="main",
metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
)
feature_extractor_name: str = field(default=None, metadata={"help": "Name or path of preprocessor config."})
freeze_feature_extractor: Optional[bool] = field(
default=True, metadata={"help": "Whether to freeze the feature extractor layers of the model."}
)
attention_mask: Optional[bool] = field(
default=True, metadata={"help": "Whether to generate an attention mask in the feature extractor."}
)
use_auth_token: bool = field(
default=False,
metadata={
"help": "Will use the token generated when running `transformers-cli login` (necessary to use this script "
"with private models)."
},
)
def main():
# See all possible arguments in src/transformers/training_args.py
# or by passing the --help flag to this script.
# We now keep distinct sets of args, for a cleaner separation of concerns.
parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
# If we pass only one argument to the script and it's the path to a json file,
# let's parse it to get our arguments.
model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
else:
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
# Setup logging
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
handlers=[logging.StreamHandler(sys.stdout)],
)
log_level = training_args.get_process_log_level()
logger.setLevel(log_level)
transformers.utils.logging.set_verbosity(log_level)
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()
# Log on each process the small summary:
logger.warning(
f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu} "
+ f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
)
logger.info(f"Training/evaluation parameters {training_args}")
# Set seed before initializing model.
set_seed(training_args.seed)
# Detecting last checkpoint.
last_checkpoint = None
if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
last_checkpoint = get_last_checkpoint(training_args.output_dir)
if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
raise ValueError(
f"Output directory ({training_args.output_dir}) already exists and is not empty. "
"Use --overwrite_output_dir to train from scratch."
)
elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
logger.info(
f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
"the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
)
# Initialize our dataset and prepare it for the audio classification task.
raw_datasets = DatasetDict()
raw_datasets["train"] = load_dataset(
data_args.dataset_name, data_args.dataset_config_name, split=data_args.train_split_name
)
raw_datasets["eval"] = load_dataset(
data_args.dataset_name, data_args.dataset_config_name, split=data_args.eval_split_name
)
if data_args.audio_column_name not in raw_datasets["train"].column_names:
raise ValueError(
f"--audio_column_name {data_args.audio_column_name} not found in dataset '{data_args.dataset_name}'. "
"Make sure to set `--audio_column_name` to the correct audio column - one of "
f"{', '.join(raw_datasets['train'].column_names)}."
)
if data_args.label_column_name not in raw_datasets["train"].column_names:
raise ValueError(
f"--label_column_name {data_args.label_column_name} not found in dataset '{data_args.dataset_name}'. "
"Make sure to set `--label_column_name` to the correct text column - one of "
f"{', '.join(raw_datasets['train'].column_names)}."
)
# Setting `return_attention_mask=True` is the way to get a correctly masked mean-pooling over
# transformer outputs in the classifier, but it doesn't always lead to better accuracy
feature_extractor = AutoFeatureExtractor.from_pretrained(
model_args.feature_extractor_name or model_args.model_name_or_path,
return_attention_mask=model_args.attention_mask,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
)
def train_transforms(batch):
"""Apply train_transforms across a batch."""
output_batch = {"input_values": []}
for f in batch[data_args.audio_column_name]:
wav = load_audio(f, sample_rate=feature_extractor.sampling_rate)
wav = random_subsample(
wav, max_length=data_args.max_length_seconds, sample_rate=feature_extractor.sampling_rate
)
output_batch["input_values"].append(wav)
output_batch["labels"] = [label for label in batch[data_args.label_column_name]]
return output_batch
def val_transforms(batch):
"""Apply val_transforms across a batch."""
output_batch = {"input_values": []}
for f in batch[data_args.audio_column_name]:
wav = load_audio(f, sample_rate=feature_extractor.sampling_rate)
output_batch["input_values"].append(wav)
output_batch["labels"] = [label for label in batch[data_args.label_column_name]]
return output_batch
# Prepare label mappings.
# We'll include these in the model's config to get human readable labels in the Inference API.
labels = raw_datasets["train"].features[data_args.label_column_name].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
label2id[label] = str(i)
id2label[str(i)] = label
# Load the accuracy metric from the datasets package
metric = datasets.load_metric("accuracy")
# Define our compute_metrics function. It takes an `EvalPrediction` object (a namedtuple with
# `predictions` and `label_ids` fields) and has to return a dictionary string to float.
def compute_metrics(eval_pred):
"""Computes accuracy on a batch of predictions"""
predictions = np.argmax(eval_pred.predictions, axis=1)
return metric.compute(predictions=predictions, references=eval_pred.label_ids)
config = AutoConfig.from_pretrained(
model_args.config_name or model_args.model_name_or_path,
num_labels=len(labels),
label2id=label2id,
id2label=id2label,
finetuning_task="audio-classification",
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
)
model = AutoModelForAudioClassification.from_pretrained(
model_args.model_name_or_path,
from_tf=bool(".ckpt" in model_args.model_name_or_path),
config=config,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
)
# freeze the convolutional waveform encoder
if model_args.freeze_feature_extractor:
model.freeze_feature_extractor()
if training_args.do_train:
if "train" not in raw_datasets:
raise ValueError("--do_train requires a train dataset")
if data_args.max_train_samples is not None:
raw_datasets["train"] = (
raw_datasets["train"].shuffle(seed=training_args.seed).select(range(data_args.max_train_samples))
)
# Set the training transforms
raw_datasets["train"].set_transform(train_transforms, output_all_columns=False)
if training_args.do_eval:
if "eval" not in raw_datasets:
raise ValueError("--do_eval requires a validation dataset")
if data_args.max_eval_samples is not None:
raw_datasets["eval"] = (
raw_datasets["eval"].shuffle(seed=training_args.seed).select(range(data_args.max_eval_samples))
)
# Set the validation transforms
raw_datasets["eval"].set_transform(val_transforms, output_all_columns=False)
# Initialize our trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=raw_datasets["train"] if training_args.do_train else None,
eval_dataset=raw_datasets["eval"] if training_args.do_eval else None,
compute_metrics=compute_metrics,
tokenizer=feature_extractor,
)
# Training
if training_args.do_train:
checkpoint = None
if training_args.resume_from_checkpoint is not None:
checkpoint = training_args.resume_from_checkpoint
elif last_checkpoint is not None:
checkpoint = last_checkpoint
train_result = trainer.train(resume_from_checkpoint=checkpoint)
trainer.save_model()
trainer.log_metrics("train", train_result.metrics)
trainer.save_metrics("train", train_result.metrics)
trainer.save_state()
# Evaluation
if training_args.do_eval:
metrics = trainer.evaluate()
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)
# Write model card and (optionally) push to hub
kwargs = {
"finetuned_from": model_args.model_name_or_path,
"tasks": "audio-classification",
"dataset": data_args.dataset_name,
"tags": ["audio-classification"],
}
if training_args.push_to_hub:
trainer.push_to_hub(**kwargs)
else:
trainer.create_model_card(**kwargs)
if __name__ == "__main__":
main()
......@@ -40,12 +40,14 @@ SRC_DIRS = [
"translation",
"image-classification",
"speech-recognition",
"audio-classification",
]
]
sys.path.extend(SRC_DIRS)
if SRC_DIRS is not None:
import run_audio_classification
import run_clm
import run_generation
import run_glue
......@@ -410,3 +412,38 @@ class ExamplesTests(TestCasePlus):
run_speech_recognition_ctc.main()
result = get_results(tmp_dir)
self.assertLess(result["eval_loss"], result["train_loss"])
def test_run_audio_classification(self):
stream_handler = logging.StreamHandler(sys.stdout)
logger.addHandler(stream_handler)
tmp_dir = self.get_auto_remove_tmp_dir()
testargs = f"""
run_audio_classification.py
--output_dir {tmp_dir}
--model_name_or_path hf-internal-testing/tiny-random-wav2vec2
--dataset_name anton-l/superb_demo
--dataset_config_name ks
--train_split_name test
--eval_split_name test
--audio_column_name file
--label_column_name label
--do_train
--do_eval
--learning_rate 1e-4
--per_device_train_batch_size 2
--per_device_eval_batch_size 1
--remove_unused_columns False
--overwrite_output_dir True
--num_train_epochs 10
--max_steps 50
--seed 42
""".split()
if is_cuda_and_apex_available():
testargs.append("--fp16")
with patch.object(sys, "argv", testargs):
run_audio_classification.main()
result = get_results(tmp_dir)
self.assertLess(result["eval_loss"], result["train_loss"])
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment