Unverified Commit 7c5d7991 authored by Steven Liu's avatar Steven Liu Committed by GitHub
Browse files

Update audio examples with MInDS-14 (#16633)

* ✨ update audio examples with minds dataset

* πŸ– make style

* πŸ– minor fixes for doctests
parent 4d461067
......@@ -199,22 +199,22 @@ Audio inputs are preprocessed differently than textual inputs, but the end goal
pip install datasets
```
Load the keyword spotting task from the [SUPERB](https://huggingface.co/datasets/superb) benchmark (see the πŸ€— [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset):
Load the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset (see the πŸ€— [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset):
```py
>>> from datasets import load_dataset, Audio
>>> dataset = load_dataset("superb", "ks")
>>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")
```
Access the first element of the `audio` column to take a look at the input. Calling the `audio` column will automatically load and resample the audio file:
```py
>>> dataset["train"][0]["audio"]
{'array': array([ 0. , 0. , 0. , ..., -0.00592041,
-0.00405884, -0.00253296], dtype=float32),
'path': '/root/.cache/huggingface/datasets/downloads/extracted/05734a36d88019a09725c20cc024e1c4e7982e37d7d55c0c1ca1742ea1cdd47f/_background_noise_/doing_the_dishes.wav',
'sampling_rate': 16000}
>>> dataset[0]["audio"]
{'array': array([ 0. , 0.00024414, -0.00024414, ..., -0.00024414,
0. , 0. ], dtype=float32),
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
'sampling_rate': 8000}
```
This returns three items:
......@@ -227,34 +227,34 @@ This returns three items:
For this tutorial, you will use the [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) model. As you can see from the model card, the Wav2Vec2 model is pretrained on 16kHz sampled speech audio. It is important your audio data's sampling rate matches the sampling rate of the dataset used to pretrain the model. If your data's sampling rate isn't the same, then you need to resample your audio data.
For example, load the [LJ Speech](https://huggingface.co/datasets/lj_speech) dataset which has a sampling rate of 22050kHz. In order to use the Wav2Vec2 model with this dataset, downsample the sampling rate to 16kHz:
For example, the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset has a sampling rate of 8000kHz. In order to use the Wav2Vec2 model with this dataset, upsample the sampling rate to 16kHz:
```py
>>> lj_speech = load_dataset("lj_speech", split="train")
>>> lj_speech[0]["audio"]
{'array': array([-7.3242188e-04, -7.6293945e-04, -6.4086914e-04, ...,
7.3242188e-04, 2.1362305e-04, 6.1035156e-05], dtype=float32),
'path': '/root/.cache/huggingface/datasets/downloads/extracted/917ece08c95cf0c4115e45294e3cd0dee724a1165b7fc11798369308a465bd26/LJSpeech-1.1/wavs/LJ001-0001.wav',
'sampling_rate': 22050}
>>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")
>>> dataset[0]["audio"]
{'array': array([ 0. , 0.00024414, -0.00024414, ..., -0.00024414,
0. , 0. ], dtype=float32),
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
'sampling_rate': 8000}
```
1. Use πŸ€— Datasets' [`cast_column`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.cast_column) method to downsample the sampling rate to 16kHz:
1. Use πŸ€— Datasets' [`cast_column`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.cast_column) method to upsample the sampling rate to 16kHz:
```py
>>> lj_speech = lj_speech.cast_column("audio", Audio(sampling_rate=16_000))
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
```
2. Load the audio file:
```py
>>> lj_speech[0]["audio"]
{'array': array([-0.00064146, -0.00074657, -0.00068768, ..., 0.00068341,
0.00014045, 0. ], dtype=float32),
'path': '/root/.cache/huggingface/datasets/downloads/extracted/917ece08c95cf0c4115e45294e3cd0dee724a1165b7fc11798369308a465bd26/LJSpeech-1.1/wavs/LJ001-0001.wav',
>>> dataset[0]["audio"]
{'array': array([ 2.3443763e-05, 2.1729663e-04, 2.2145823e-04, ...,
3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32),
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
'sampling_rate': 16000}
```
As you can see, the `sampling_rate` was downsampled to 16kHz. Now that you know how resampling works, let's return to our previous example with the SUPERB dataset!
As you can see, the `sampling_rate` is now 16kHz!
### Feature extractor
......@@ -271,9 +271,10 @@ Load the feature extractor with [`AutoFeatureExtractor.from_pretrained`]:
Pass the audio `array` to the feature extractor. We also recommend adding the `sampling_rate` argument in the feature extractor in order to better debug any silent errors that may occur.
```py
>>> audio_input = [dataset["train"][0]["audio"]["array"]]
>>> audio_input = [dataset[0]["audio"]["array"]]
>>> feature_extractor(audio_input, sampling_rate=16000)
{'input_values': [array([ 0.00045439, 0.00045439, 0.00045439, ..., -0.1578519 , -0.10807519, -0.06727459], dtype=float32)]}
{'input_values': [array([ 3.8106556e-04, 2.7506407e-03, 2.8015103e-03, ...,
5.6335266e-04, 4.6588284e-06, -1.7142107e-04], dtype=float32)]}
```
### Pad and truncate
......@@ -281,11 +282,11 @@ Pass the audio `array` to the feature extractor. We also recommend adding the `s
Just like the tokenizer, you can apply padding or truncation to handle variable sequences in a batch. Take a look at the sequence length of these two audio samples:
```py
>>> dataset["train"][0]["audio"]["array"].shape
(1522930,)
>>> dataset[0]["audio"]["array"].shape
(173398,)
>>> dataset["train"][1]["audio"]["array"].shape
(988891,)
>>> dataset[1]["audio"]["array"].shape
(106496,)
```
As you can see, the first sample has a longer sequence than the second sample. Let's create a function that will preprocess the dataset. Specify a maximum sample length, and the feature extractor will either pad or truncate the sequences to match it:
......@@ -297,7 +298,7 @@ As you can see, the first sample has a longer sequence than the second sample. L
... audio_arrays,
... sampling_rate=16000,
... padding=True,
... max_length=1000000,
... max_length=100000,
... truncation=True,
... )
... return inputs
......@@ -306,17 +307,17 @@ As you can see, the first sample has a longer sequence than the second sample. L
Apply the function to the the first few examples in the dataset:
```py
>>> processed_dataset = preprocess_function(dataset["train"][:5])
>>> processed_dataset = preprocess_function(dataset[:5])
```
Now take another look at the processed sample lengths:
```py
>>> processed_dataset["input_values"][0].shape
(1000000,)
(100000,)
>>> processed_dataset["input_values"][1].shape
(1000000,)
(100000,)
```
The lengths of the first two samples now match the maximum length you specified.
......
......@@ -118,9 +118,9 @@ Create a [`pipeline`] with the task you want to solve for and the model you want
Next, load a dataset (see the πŸ€— Datasets [Quick Start](https://huggingface.co/docs/datasets/quickstart.html) for more details) you'd like to iterate over. For example, let's load the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset:
```py
>>> import datasets
>>> from datasets import load_dataset
>>> dataset = datasets.load_dataset("PolyAI/minds14", name="en-US", split="train") # doctest: +IGNORE_RESULT
>>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train") # doctest: +IGNORE_RESULT
```
You can pass a whole dataset pipeline:
......
......@@ -16,7 +16,7 @@ specific language governing permissions and limitations under the License.
Automatic speech recognition (ASR) converts a speech signal to text. It is an example of a sequence-to-sequence task, going from a sequence of audio inputs to textual outputs. Voice assistants like Siri and Alexa utilize ASR models to assist users.
This guide will show you how to fine-tune [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) on the [TIMIT](https://huggingface.co/datasets/timit_asr) dataset to transcribe audio to text.
This guide will show you how to fine-tune [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) on the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset to transcribe audio to text.
<Tip>
......@@ -24,50 +24,54 @@ See the automatic speech recognition [task page](https://huggingface.co/tasks/au
</Tip>
## Load TIMIT dataset
## Load MInDS-14 dataset
Load the TIMIT dataset from the πŸ€— Datasets library:
Load the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) from the πŸ€— Datasets library:
```py
>>> from datasets import load_dataset
>>> from datasets import load_dataset, Audio
>>> timit = load_dataset("timit_asr")
>>> minds = load_dataset("PolyAI/minds14", name="en-US", split="train")
```
Then take a look at an example:
Split this dataset into a train and test set:
```py
>>> timit
>>> minds = minds.train_test_split(test_size=0.2)
```
Then take a look at the dataset:
```py
>>> minds
DatasetDict({
train: Dataset({
features: ['file', 'audio', 'text', 'phonetic_detail', 'word_detail', 'dialect_region', 'sentence_type', 'speaker_id', 'id'],
num_rows: 4620
features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
num_rows: 450
})
test: Dataset({
features: ['file', 'audio', 'text', 'phonetic_detail', 'word_detail', 'dialect_region', 'sentence_type', 'speaker_id', 'id'],
num_rows: 1680
features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
num_rows: 113
})
})
```
While the dataset contains a lot of helpful information, like `dialect_region` and `sentence_type`, you will focus on the `audio` and `text` fields in this guide. Remove the other columns:
While the dataset contains a lot of helpful information, like `lang_id` and `intent_class`, you will focus on the `audio` and `transcription` columns in this guide. Remove the other columns:
```py
>>> timit = timit.remove_columns(
... ["phonetic_detail", "word_detail", "dialect_region", "id", "sentence_type", "speaker_id"]
... )
>>> minds = minds.remove_columns(["english_transcription", "intent_class", "lang_id"])
```
Take a look at the example again:
```py
>>> timit["train"][0]
{'audio': {'array': array([-2.1362305e-04, 6.1035156e-05, 3.0517578e-05, ...,
-3.0517578e-05, -9.1552734e-05, -6.1035156e-05], dtype=float32),
'path': '/root/.cache/huggingface/datasets/downloads/extracted/404950a46da14eac65eb4e2a8317b1372fb3971d980d91d5d5b221275b1fd7e0/data/TRAIN/DR4/MMDM0/SI681.WAV',
'sampling_rate': 16000},
'file': '/root/.cache/huggingface/datasets/downloads/extracted/404950a46da14eac65eb4e2a8317b1372fb3971d980d91d5d5b221275b1fd7e0/data/TRAIN/DR4/MMDM0/SI681.WAV',
'text': 'Would such an act of refusal be useful?'}
>>> minds["train"][0]
{'audio': {'array': array([-0.00024414, 0. , 0. , ..., 0.00024414,
0.00024414, 0.00024414], dtype=float32),
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
'sampling_rate': 8000},
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
'transcription': "hi I'm trying to use the banking app on my phone and currently my checking and savings account balance is not refreshing"}
```
The `audio` column contains a 1-dimensional `array` of the speech signal that must be called to load and resample the audio file.
......@@ -82,6 +86,19 @@ Load the Wav2Vec2 processor to process the audio signal and transcribed text:
>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base")
```
The [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset has a sampling rate of 8000khz. You will need to resample the dataset to use the pretrained Wav2Vec2 model:
```py
>>> minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
>>> minds["train"][0]
{'audio': {'array': array([-2.38064706e-04, -1.58618059e-04, -5.43987835e-06, ...,
2.78103951e-04, 2.38446111e-04, 1.18740834e-04], dtype=float32),
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
'sampling_rate': 16000},
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
'transcription': "hi I'm trying to use the banking app on my phone and currently my checking and savings account balance is not refreshing"}
```
The preprocessing function needs to:
1. Call the `audio` column to load and resample the audio file.
......@@ -96,14 +113,14 @@ The preprocessing function needs to:
... batch["input_length"] = len(batch["input_values"])
... with processor.as_target_processor():
... batch["labels"] = processor(batch["text"]).input_ids
... batch["labels"] = processor(batch["transcription"]).input_ids
... return batch
```
Use πŸ€— Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the map function by increasing the number of processes with `num_proc`. Remove the columns you don't need:
```py
>>> timit = timit.map(prepare_dataset, remove_columns=timit.column_names["train"], num_proc=4)
>>> encoded_minds = minds.map(prepare_dataset, remove_columns=minds.column_names["train"], num_proc=4)
```
πŸ€— Transformers doesn't have a data collator for automatic speech recognition, so you will need to create one. You can adapt the [`DataCollatorWithPadding`] to create a batch of examples for automatic speech recognition. It will also dynamically pad your text and labels to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the `tokenizer` function by setting `padding=True`, dynamic padding is more efficient.
......@@ -165,7 +182,7 @@ Load Wav2Vec2 with [`AutoModelForCTC`]. For `ctc_loss_reduction`, it is often be
>>> from transformers import AutoModelForCTC, TrainingArguments, Trainer
>>> model = AutoModelForCTC.from_pretrained(
... "facebook/wav2vec-base",
... "facebook/wav2vec2-base",
... ctc_loss_reduction="mean",
... pad_token_id=processor.tokenizer.pad_token_id,
... )
......@@ -200,8 +217,8 @@ At this point, only three steps remain:
>>> trainer = Trainer(
... model=model,
... args=training_args,
... train_dataset=timit["train"],
... eval_dataset=timit["test"],
... train_dataset=encoded_minds["train"],
... eval_dataset=encoded_minds["test"],
... tokenizer=processor.feature_extractor,
... data_collator=data_collator,
... )
......
......@@ -16,7 +16,7 @@ specific language governing permissions and limitations under the License.
Audio classification assigns a label or class to audio data. It is similar to text classification, except an audio input is continuous and must be discretized, whereas text can be split into tokens. Some practical applications of audio classification include identifying intent, speakers, and even animal species by their sounds.
This guide will show you how to fine-tune [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) on the Keyword Spotting subset of the [SUPERB](https://huggingface.co/datasets/superb) benchmark to classify utterances.
This guide will show you how to fine-tune [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) on the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) to classify intent.
<Tip>
......@@ -24,27 +24,59 @@ See the audio classification [task page](https://huggingface.co/tasks/audio-clas
</Tip>
## Load SUPERB dataset
## Load MInDS-14 dataset
Load the SUPERB dataset from the πŸ€— Datasets library:
Load the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) from the πŸ€— Datasets library:
```py
>>> from datasets import load_dataset
>>> from datasets import load_dataset, Audio
>>> ks = load_dataset("superb", "ks")
>>> minds = load_dataset("PolyAI/minds14", name="en-US", split="train")
```
Then take a look at an example:
Split this dataset into a train and test set:
```py
>>> ks["train"][0]
{'audio': {'array': array([ 0. , 0. , 0. , ..., -0.00592041, -0.00405884, -0.00253296], dtype=float32), 'path': '/root/.cache/huggingface/datasets/downloads/extracted/05734a36d88019a09725c20cc024e1c4e7982e37d7d55c0c1ca1742ea1cdd47f/_background_noise_/doing_the_dishes.wav', 'sampling_rate': 16000}, 'file': '/root/.cache/huggingface/datasets/downloads/extracted/05734a36d88019a09725c20cc024e1c4e7982e37d7d55c0c1ca1742ea1cdd47f/_background_noise_/doing_the_dishes.wav', 'label': 10}
>>> minds = minds.train_test_split(test_size=0.2)
```
The `audio` column contains a 1-dimensional `array` of the speech signal that must be called to load and resample the audio file. The `label` column is an integer that represents the utterance class. Create a dictionary that maps a label name to an integer and vice versa. The mapping will help the model recover the label name from the label number:
Then take a look at the dataset:
```py
>>> labels = ks["train"].features["label"].names
>>> minds
DatasetDict({
train: Dataset({
features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
num_rows: 450
})
test: Dataset({
features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
num_rows: 113
})
})
```
While the dataset contains a lot of other useful information, like `lang_id` and `english_transcription`, you will focus on the `audio` and `intent_class` in this guide. Remove the other columns:
```py
>>> minds = minds.remove_columns(["path", "transcription", "english_transcription", "lang_id"])
```
Take a look at an example now:
```py
>>> minds["train"][0]
{'audio': {'array': array([ 0. , 0. , 0. , ..., -0.00048828,
-0.00024414, -0.00024414], dtype=float32),
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602b9a5fbb1e6d0fbce91f52.wav',
'sampling_rate': 8000},
'intent_class': 2}
```
The `audio` column contains a 1-dimensional `array` of the speech signal that must be called to load and resample the audio file. The `intent_class` column is an integer that represents the class id of intent. Create a dictionary that maps a label name to an integer and vice versa. The mapping will help the model recover the label name from the label number:
```py
>>> labels = minds["train"].features["intent_class"].names
>>> label2id, id2label = dict(), dict()
>>> for i, label in enumerate(labels):
... label2id[label] = str(i)
......@@ -54,11 +86,11 @@ The `audio` column contains a 1-dimensional `array` of the speech signal that mu
Now you can convert the label number to a label name for more information:
```py
>>> id2label[str(10)]
'_silence_'
>>> id2label[str(2)]
'app_error'
```
Each keyword - or label - corresponds to a number; `10` indicates `silence` in the example above.
Each keyword - or label - corresponds to a number; `2` indicates `app_error` in the example above.
## Preprocess
......@@ -70,6 +102,18 @@ Load the Wav2Vec2 feature extractor to process the audio signal:
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")
```
The [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset has a sampling rate of 8000khz. You will need to resample the dataset to use the pretrained Wav2Vec2 model:
```py
>>> minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
>>> minds["train"][0]
{'audio': {'array': array([ 2.2098757e-05, 4.6582241e-05, -2.2803260e-05, ...,
-2.8419291e-04, -2.3305941e-04, -1.1425107e-04], dtype=float32),
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602b9a5fbb1e6d0fbce91f52.wav',
'sampling_rate': 16000},
'intent_class': 2}
```
The preprocessing function needs to:
1. Call the `audio` column to load and if necessary resample the audio file.
......@@ -85,10 +129,11 @@ The preprocessing function needs to:
... return inputs
```
Use πŸ€— Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don't need:
Use πŸ€— Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don't need, and rename `intent_class` to `label` because that is what the model expects:
```py
>>> encoded_ks = ks.map(preprocess_function, remove_columns=["audio", "file"], batched=True)
>>> encoded_minds = minds.map(preprocess_function, remove_columns="audio", batched=True)
>>> encoded_minds = encoded_minds.rename_column("intent_class", "label")
```
## Train
......@@ -130,8 +175,8 @@ At this point, only three steps remain:
>>> trainer = Trainer(
... model=model,
... args=training_args,
... train_dataset=encoded_ks["train"],
... eval_dataset=encoded_ks["validation"],
... train_dataset=encoded_minds["train"],
... eval_dataset=encoded_minds["test"],
... tokenizer=feature_extractor,
... )
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment