audio_classification.mdx 7.19 KB
Newer Older
Steven Liu's avatar
Steven Liu committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
<!--Copyright 2022 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# Audio classification

<Youtube id="KWwzcmG98Ds"/>

Audio classification assigns a label or class to audio data. It is similar to text classification, except an audio input is continuous and must be discretized, whereas text can be split into tokens. Some practical applications of audio classification include identifying intent, speakers, and even animal species by their sounds.

19
This guide will show you how to fine-tune [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) on the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) to classify intent.
Steven Liu's avatar
Steven Liu committed
20
21
22
23
24
25
26

<Tip>

See the audio classification [task page](https://huggingface.co/tasks/audio-classification) for more information about its associated models, datasets, and metrics.

</Tip>

27
## Load MInDS-14 dataset
Steven Liu's avatar
Steven Liu committed
28

29
Load the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) from the 馃 Datasets library:
Steven Liu's avatar
Steven Liu committed
30
31

```py
32
>>> from datasets import load_dataset, Audio
Steven Liu's avatar
Steven Liu committed
33

34
>>> minds = load_dataset("PolyAI/minds14", name="en-US", split="train")
Steven Liu's avatar
Steven Liu committed
35
36
```

37
Split this dataset into a train and test set:
Steven Liu's avatar
Steven Liu committed
38
39

```py
40
>>> minds = minds.train_test_split(test_size=0.2)
Steven Liu's avatar
Steven Liu committed
41
42
```

43
Then take a look at the dataset:
Steven Liu's avatar
Steven Liu committed
44
45

```py
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
>>> minds
DatasetDict({
    train: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 450
    })
    test: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 113
    })
})
```

While the dataset contains a lot of other useful information, like `lang_id` and `english_transcription`, you will focus on the `audio` and `intent_class` in this guide. Remove the other columns:

```py
>>> minds = minds.remove_columns(["path", "transcription", "english_transcription", "lang_id"])
```

Take a look at an example now:

```py
>>> minds["train"][0]
{'audio': {'array': array([ 0.        ,  0.        ,  0.        , ..., -0.00048828,
         -0.00024414, -0.00024414], dtype=float32),
  'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602b9a5fbb1e6d0fbce91f52.wav',
  'sampling_rate': 8000},
 'intent_class': 2}
```

The `audio` column contains a 1-dimensional `array` of the speech signal that must be called to load and resample the audio file. The `intent_class` column is an integer that represents the class id of intent. Create a dictionary that maps a label name to an integer and vice versa. The mapping will help the model recover the label name from the label number:

```py
>>> labels = minds["train"].features["intent_class"].names
Steven Liu's avatar
Steven Liu committed
80
81
82
83
84
85
86
87
88
>>> label2id, id2label = dict(), dict()
>>> for i, label in enumerate(labels):
...     label2id[label] = str(i)
...     id2label[str(i)] = label
```

Now you can convert the label number to a label name for more information:

```py
89
90
>>> id2label[str(2)]
'app_error'
Steven Liu's avatar
Steven Liu committed
91
92
```

93
Each keyword - or label - corresponds to a number; `2` indicates `app_error` in the example above.
Steven Liu's avatar
Steven Liu committed
94
95
96
97
98
99
100
101
102
103
104

## Preprocess

Load the Wav2Vec2 feature extractor to process the audio signal:

```py
>>> from transformers import AutoFeatureExtractor

>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")
```

105
106
107
108
109
110
111
112
113
114
115
116
The [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset has a sampling rate of 8000khz. You will need to resample the dataset to use the pretrained Wav2Vec2 model:

```py
>>> minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
>>> minds["train"][0]
{'audio': {'array': array([ 2.2098757e-05,  4.6582241e-05, -2.2803260e-05, ...,
         -2.8419291e-04, -2.3305941e-04, -1.1425107e-04], dtype=float32),
  'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602b9a5fbb1e6d0fbce91f52.wav',
  'sampling_rate': 16000},
 'intent_class': 2}
```

Steven Liu's avatar
Steven Liu committed
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
The preprocessing function needs to:

1. Call the `audio` column to load and if necessary resample the audio file.
2. Check the sampling rate of the audio file matches the sampling rate of the audio data a model was pretrained with. You can find this information on the Wav2Vec2 [model card]((https://huggingface.co/facebook/wav2vec2-base)).
3. Set a maximum input length so longer inputs are batched without being truncated.

```py
>>> def preprocess_function(examples):
...     audio_arrays = [x["array"] for x in examples["audio"]]
...     inputs = feature_extractor(
...         audio_arrays, sampling_rate=feature_extractor.sampling_rate, max_length=16000, truncation=True
...     )
...     return inputs
```

Sylvain Gugger's avatar
Sylvain Gugger committed
132
Use 馃 Datasets [`~datasets.Dataset.map`] function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don't need, and rename `intent_class` to `label` because that is what the model expects:
Steven Liu's avatar
Steven Liu committed
133
134

```py
135
136
>>> encoded_minds = minds.map(preprocess_function, remove_columns="audio", batched=True)
>>> encoded_minds = encoded_minds.rename_column("intent_class", "label")
Steven Liu's avatar
Steven Liu committed
137
138
```

139
## Train
Steven Liu's avatar
Steven Liu committed
140

141
142
<frameworkcontent>
<pt>
Steven Liu's avatar
Steven Liu committed
143
144
145
146
147
148
149
150
151
152
153
154
155
Load Wav2Vec2 with [`AutoModelForAudioClassification`]. Specify the number of labels, and pass the model the mapping between label number and label class:

```py
>>> from transformers import AutoModelForAudioClassification, TrainingArguments, Trainer

>>> num_labels = len(id2label)
>>> model = AutoModelForAudioClassification.from_pretrained(
...     "facebook/wav2vec2-base", num_labels=num_labels, label2id=label2id, id2label=id2label
... )
```

<Tip>

Steven Liu's avatar
Steven Liu committed
156
If you aren't familiar with fine-tuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#finetune-with-trainer)!
Steven Liu's avatar
Steven Liu committed
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177

</Tip>

At this point, only three steps remain:

1. Define your training hyperparameters in [`TrainingArguments`].
2. Pass the training arguments to [`Trainer`] along with the model, datasets, and feature extractor.
3. Call [`~Trainer.train`] to fine-tune your model.

```py
>>> training_args = TrainingArguments(
...     output_dir="./results",
...     evaluation_strategy="epoch",
...     save_strategy="epoch",
...     learning_rate=3e-5,
...     num_train_epochs=5,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
178
179
...     train_dataset=encoded_minds["train"],
...     eval_dataset=encoded_minds["test"],
Steven Liu's avatar
Steven Liu committed
180
181
182
183
184
...     tokenizer=feature_extractor,
... )

>>> trainer.train()
```
185
186
</pt>
</frameworkcontent>
Steven Liu's avatar
Steven Liu committed
187
188
189

<Tip>

190
For a more in-depth example of how to fine-tune a model for audio classification, take a look at the corresponding [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/audio_classification.ipynb).
Steven Liu's avatar
Steven Liu committed
191
192

</Tip>