"tests/models/xlm/test_modeling_xlm.py" did not exist on "c40c7e213bdd0479bdca69df0c500004a7294d39"
asr.mdx 9.01 KB
Newer Older
Steven Liu's avatar
Steven Liu committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
<!--Copyright 2022 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# Automatic speech recognition

<Youtube id="TksaY_FDgnk"/>

Automatic speech recognition (ASR) converts a speech signal to text. It is an example of a sequence-to-sequence task, going from a sequence of audio inputs to textual outputs. Voice assistants like Siri and Alexa utilize ASR models to assist users.

19
This guide will show you how to fine-tune [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) on the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset to transcribe audio to text.
Steven Liu's avatar
Steven Liu committed
20
21
22
23
24
25
26

<Tip>

See the automatic speech recognition [task page](https://huggingface.co/tasks/automatic-speech-recognition) for more information about its associated models, datasets, and metrics.

</Tip>

27
## Load MInDS-14 dataset
Steven Liu's avatar
Steven Liu committed
28

29
Load the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) from the 馃 Datasets library:
Steven Liu's avatar
Steven Liu committed
30
31

```py
32
>>> from datasets import load_dataset, Audio
Steven Liu's avatar
Steven Liu committed
33

34
>>> minds = load_dataset("PolyAI/minds14", name="en-US", split="train")
Steven Liu's avatar
Steven Liu committed
35
36
```

37
Split this dataset into a train and test set:
Steven Liu's avatar
Steven Liu committed
38
39

```py
40
41
42
43
44
45
46
>>> minds = minds.train_test_split(test_size=0.2)
```

Then take a look at the dataset:

```py
>>> minds
Steven Liu's avatar
Steven Liu committed
47
48
DatasetDict({
    train: Dataset({
49
50
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 450
Steven Liu's avatar
Steven Liu committed
51
52
    })
    test: Dataset({
53
54
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 113
Steven Liu's avatar
Steven Liu committed
55
56
57
58
    })
})
```

59
While the dataset contains a lot of helpful information, like `lang_id` and `intent_class`, you will focus on the `audio` and `transcription` columns in this guide. Remove the other columns:
Steven Liu's avatar
Steven Liu committed
60
61

```py
62
>>> minds = minds.remove_columns(["english_transcription", "intent_class", "lang_id"])
Steven Liu's avatar
Steven Liu committed
63
64
65
66
67
```

Take a look at the example again:

```py
68
69
70
71
72
73
74
>>> minds["train"][0]
{'audio': {'array': array([-0.00024414,  0.        ,  0.        , ...,  0.00024414,
          0.00024414,  0.00024414], dtype=float32),
  'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
  'sampling_rate': 8000},
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
 'transcription': "hi I'm trying to use the banking app on my phone and currently my checking and savings account balance is not refreshing"}
Steven Liu's avatar
Steven Liu committed
75
76
77
78
79
80
81
82
83
84
85
86
87
88
```

The `audio` column contains a 1-dimensional `array` of the speech signal that must be called to load and resample the audio file.

## Preprocess

Load the Wav2Vec2 processor to process the audio signal and transcribed text:

```py
>>> from transformers import AutoProcessor

>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base")
```

89
90
91
92
93
94
95
96
97
98
99
100
101
The [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset has a sampling rate of 8000khz. You will need to resample the dataset to use the pretrained Wav2Vec2 model:

```py
>>> minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
>>> minds["train"][0]
{'audio': {'array': array([-2.38064706e-04, -1.58618059e-04, -5.43987835e-06, ...,
          2.78103951e-04,  2.38446111e-04,  1.18740834e-04], dtype=float32),
  'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
  'sampling_rate': 16000},
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
 'transcription': "hi I'm trying to use the banking app on my phone and currently my checking and savings account balance is not refreshing"}
```

Steven Liu's avatar
Steven Liu committed
102
103
104
105
106
107
108
109
110
111
The preprocessing function needs to:

1. Call the `audio` column to load and resample the audio file.
2. Extract the `input_values` from the audio file.
3. Typically, when you call the processor, you call the feature extractor. Since you also want to tokenize text, instruct the processor to call the tokenizer instead with a context manager.

```py
>>> def prepare_dataset(batch):
...     audio = batch["audio"]

112
...     batch = processor(audio=audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
Steven Liu's avatar
Steven Liu committed
113
114
...     batch["input_length"] = len(batch["input_values"])

115
...     batch["labels"] = processor(text=batch["transcription"]).input_ids
Steven Liu's avatar
Steven Liu committed
116
117
118
...     return batch
```

Sylvain Gugger's avatar
Sylvain Gugger committed
119
Use 馃 Datasets [`~datasets.Dataset.map`] function to apply the preprocessing function over the entire dataset. You can speed up the map function by increasing the number of processes with `num_proc`. Remove the columns you don't need:
Steven Liu's avatar
Steven Liu committed
120
121

```py
122
>>> encoded_minds = minds.map(prepare_dataset, remove_columns=minds.column_names["train"], num_proc=4)
Steven Liu's avatar
Steven Liu committed
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
```

馃 Transformers doesn't have a data collator for automatic speech recognition, so you will need to create one. You can adapt the [`DataCollatorWithPadding`] to create a batch of examples for automatic speech recognition. It will also dynamically pad your text and labels to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the `tokenizer` function by setting `padding=True`, dynamic padding is more efficient.

Unlike other data collators, this specific data collator needs to apply a different padding method to `input_values` and `labels`. You can apply a different padding method with a context manager:

```py
>>> import torch

>>> from dataclasses import dataclass, field
>>> from typing import Any, Dict, List, Optional, Union


>>> @dataclass
... class DataCollatorCTCWithPadding:

...     processor: AutoProcessor
...     padding: Union[bool, str] = True

...     def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
...         # split inputs and labels since they have to be of different lengths and need
...         # different padding methods
...         input_features = [{"input_values": feature["input_values"]} for feature in features]
...         label_features = [{"input_ids": feature["labels"]} for feature in features]

148
149
150
...         batch = self.processor.pad(input_features, padding=self.padding, return_tensors="pt")

...         labels_batch = self.processor.pad(labels=label_features, padding=self.padding, return_tensors="pt")
Steven Liu's avatar
Steven Liu committed
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165

...         # replace padding with -100 to ignore loss correctly
...         labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

...         batch["labels"] = labels

...         return batch
```

Create a batch of examples and dynamically pad them with `DataCollatorForCTCWithPadding`:

```py
>>> data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)
```

166
## Train
Steven Liu's avatar
Steven Liu committed
167

168
169
<frameworkcontent>
<pt>
Steven Liu's avatar
Steven Liu committed
170
171
172
173
174
175
Load Wav2Vec2 with [`AutoModelForCTC`]. For `ctc_loss_reduction`, it is often better to use the average instead of the default summation:

```py
>>> from transformers import AutoModelForCTC, TrainingArguments, Trainer

>>> model = AutoModelForCTC.from_pretrained(
176
...     "facebook/wav2vec2-base",
Steven Liu's avatar
Steven Liu committed
177
178
179
180
181
182
183
...     ctc_loss_reduction="mean",
...     pad_token_id=processor.tokenizer.pad_token_id,
... )
```

<Tip>

Steven Liu's avatar
Steven Liu committed
184
If you aren't familiar with fine-tuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#finetune-with-trainer)!
Steven Liu's avatar
Steven Liu committed
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210

</Tip>

At this point, only three steps remain:

1. Define your training hyperparameters in [`TrainingArguments`].
2. Pass the training arguments to [`Trainer`] along with the model, datasets, tokenizer, and data collator.
3. Call [`~Trainer.train`] to fine-tune your model.

```py
>>> training_args = TrainingArguments(
...     output_dir="./results",
...     group_by_length=True,
...     per_device_train_batch_size=16,
...     evaluation_strategy="steps",
...     num_train_epochs=3,
...     fp16=True,
...     gradient_checkpointing=True,
...     learning_rate=1e-4,
...     weight_decay=0.005,
...     save_total_limit=2,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
211
212
...     train_dataset=encoded_minds["train"],
...     eval_dataset=encoded_minds["test"],
Steven Liu's avatar
Steven Liu committed
213
214
215
216
217
218
...     tokenizer=processor.feature_extractor,
...     data_collator=data_collator,
... )

>>> trainer.train()
```
219
220
</pt>
</frameworkcontent>
Steven Liu's avatar
Steven Liu committed
221
222
223
224
225
226

<Tip>

For a more in-depth example of how to fine-tune a model for automatic speech recognition, take a look at this blog [post](https://huggingface.co/blog/fine-tune-wav2vec2-english) for English ASR and this [post](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2) for multilingual ASR.

</Tip>