asr.md 14.8 KB
Newer Older
1
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Steven Liu's avatar
Steven Liu committed
2
3
4
5
6
7
8
9
10

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
11
12
13
14

鈿狅笍 Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

Steven Liu's avatar
Steven Liu committed
15
16
17
18
-->

# Automatic speech recognition

19
20
[[open-in-colab]]

Steven Liu's avatar
Steven Liu committed
21
22
<Youtube id="TksaY_FDgnk"/>

23
Automatic speech recognition (ASR) converts a speech signal to text, mapping a sequence of audio inputs to text outputs. Virtual assistants like Siri and Alexa use ASR models to help users everyday, and there are many other useful user-facing applications like live captioning and note-taking during meetings.
Steven Liu's avatar
Steven Liu committed
24

25
26
27
28
This guide will show you how to:

1. Finetune [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) on the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset to transcribe audio to text.
2. Use your finetuned model for inference.
Steven Liu's avatar
Steven Liu committed
29
30

<Tip>
31
The task illustrated in this tutorial is supported by the following model architectures:
Steven Liu's avatar
Steven Liu committed
32

33
34
<!--This tip is automatically generated by `make fix-copies`, do not fill manually!-->

35
[Data2VecAudio](../model_doc/data2vec-audio), [Hubert](../model_doc/hubert), [M-CTC-T](../model_doc/mctct), [SEW](../model_doc/sew), [SEW-D](../model_doc/sew-d), [UniSpeech](../model_doc/unispeech), [UniSpeechSat](../model_doc/unispeech-sat), [Wav2Vec2](../model_doc/wav2vec2), [Wav2Vec2-BERT](../model_doc/wav2vec2-bert), [Wav2Vec2-Conformer](../model_doc/wav2vec2-conformer), [WavLM](../model_doc/wavlm)
36
37

<!--End of the generated tip-->
Steven Liu's avatar
Steven Liu committed
38
39
40

</Tip>

41
42
43
44
45
46
47
48
49
50
51
52
53
54
Before you begin, make sure you have all the necessary libraries installed:

```bash
pip install transformers datasets evaluate jiwer
```

We encourage you to login to your Hugging Face account so you can upload and share your model with the community. When prompted, enter your token to login:

```py
>>> from huggingface_hub import notebook_login

>>> notebook_login()
```

55
## Load MInDS-14 dataset
Steven Liu's avatar
Steven Liu committed
56

57
Start by loading a smaller subset of the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset from the 馃 Datasets library. This'll give you a chance to experiment and make sure everything works before spending more time training on the full dataset.
Steven Liu's avatar
Steven Liu committed
58
59

```py
60
>>> from datasets import load_dataset, Audio
Steven Liu's avatar
Steven Liu committed
61

62
>>> minds = load_dataset("PolyAI/minds14", name="en-US", split="train[:100]")
Steven Liu's avatar
Steven Liu committed
63
64
```

65
Split the dataset's `train` split into a train and test set with the [`~Dataset.train_test_split`] method:
Steven Liu's avatar
Steven Liu committed
66
67

```py
68
69
70
71
72
73
74
>>> minds = minds.train_test_split(test_size=0.2)
```

Then take a look at the dataset:

```py
>>> minds
Steven Liu's avatar
Steven Liu committed
75
76
DatasetDict({
    train: Dataset({
77
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
78
        num_rows: 16
Steven Liu's avatar
Steven Liu committed
79
80
    })
    test: Dataset({
81
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
82
        num_rows: 4
Steven Liu's avatar
Steven Liu committed
83
84
85
86
    })
})
```

87
While the dataset contains a lot of useful information, like `lang_id` and `english_transcription`, you'll focus on the `audio` and `transcription` in this guide. Remove the other columns with the [`~datasets.Dataset.remove_columns`] method:
Steven Liu's avatar
Steven Liu committed
88
89

```py
90
>>> minds = minds.remove_columns(["english_transcription", "intent_class", "lang_id"])
Steven Liu's avatar
Steven Liu committed
91
92
93
94
95
```

Take a look at the example again:

```py
96
97
98
99
100
101
102
>>> minds["train"][0]
{'audio': {'array': array([-0.00024414,  0.        ,  0.        , ...,  0.00024414,
          0.00024414,  0.00024414], dtype=float32),
  'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
  'sampling_rate': 8000},
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
 'transcription': "hi I'm trying to use the banking app on my phone and currently my checking and savings account balance is not refreshing"}
Steven Liu's avatar
Steven Liu committed
103
104
```

105
106
There are two fields:

107
108
- `audio`: a 1-dimensional `array` of the speech signal that must be called to load and resample the audio file.
- `transcription`: the target text.
Steven Liu's avatar
Steven Liu committed
109
110
111

## Preprocess

112
The next step is to load a Wav2Vec2 processor to process the audio signal:
Steven Liu's avatar
Steven Liu committed
113
114
115
116
117
118
119

```py
>>> from transformers import AutoProcessor

>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base")
```

120
The MInDS-14 dataset has a sampling rate of 8000kHz (you can find this information in its [dataset card](https://huggingface.co/datasets/PolyAI/minds14)), which means you'll need to resample the dataset to 16000kHz to use the pretrained Wav2Vec2 model:
121
122
123
124
125
126
127
128
129
130
131
132

```py
>>> minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
>>> minds["train"][0]
{'audio': {'array': array([-2.38064706e-04, -1.58618059e-04, -5.43987835e-06, ...,
          2.78103951e-04,  2.38446111e-04,  1.18740834e-04], dtype=float32),
  'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
  'sampling_rate': 16000},
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
 'transcription': "hi I'm trying to use the banking app on my phone and currently my checking and savings account balance is not refreshing"}
```

133
134
135
136
137
138
139
140
141
142
143
As you can see in the `transcription` above, the text contains a mix of upper and lowercase characters. The Wav2Vec2 tokenizer is only trained on uppercase characters so you'll need to make sure the text matches the tokenizer's vocabulary:

```py
>>> def uppercase(example):
...     return {"transcription": example["transcription"].upper()}


>>> minds = minds.map(uppercase)
```

Now create a preprocessing function that:
Steven Liu's avatar
Steven Liu committed
144

145
146
1. Calls the `audio` column to load and resample the audio file.
2. Extracts the `input_values` from the audio file and tokenize the `transcription` column with the processor.
Steven Liu's avatar
Steven Liu committed
147
148
149
150

```py
>>> def prepare_dataset(batch):
...     audio = batch["audio"]
151
152
...     batch = processor(audio["array"], sampling_rate=audio["sampling_rate"], text=batch["transcription"])
...     batch["input_length"] = len(batch["input_values"][0])
Steven Liu's avatar
Steven Liu committed
153
154
155
...     return batch
```

156
To apply the preprocessing function over the entire dataset, use 馃 Datasets [`~datasets.Dataset.map`] function. You can speed up `map` by increasing the number of processes with the `num_proc` parameter. Remove the columns you don't need with the [`~datasets.Dataset.remove_columns`] method:
Steven Liu's avatar
Steven Liu committed
157
158

```py
159
>>> encoded_minds = minds.map(prepare_dataset, remove_columns=minds.column_names["train"], num_proc=4)
Steven Liu's avatar
Steven Liu committed
160
161
```

162
馃 Transformers doesn't have a data collator for ASR, so you'll need to adapt the [`DataCollatorWithPadding`] to create a batch of examples. It'll also dynamically pad your text and labels to the length of the longest element in its batch (instead of the entire dataset) so they are a uniform length. While it is possible to pad your text in the `tokenizer` function by setting `padding=True`, dynamic padding is more efficient.
Steven Liu's avatar
Steven Liu committed
163

164
Unlike other data collators, this specific data collator needs to apply a different padding method to `input_values` and `labels`:
Steven Liu's avatar
Steven Liu committed
165
166
167
168
169
170
171
172
173
174
175

```py
>>> import torch

>>> from dataclasses import dataclass, field
>>> from typing import Any, Dict, List, Optional, Union


>>> @dataclass
... class DataCollatorCTCWithPadding:
...     processor: AutoProcessor
176
...     padding: Union[bool, str] = "longest"
Steven Liu's avatar
Steven Liu committed
177
178
179
180

...     def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
...         # split inputs and labels since they have to be of different lengths and need
...         # different padding methods
181
...         input_features = [{"input_values": feature["input_values"][0]} for feature in features]
Steven Liu's avatar
Steven Liu committed
182
183
...         label_features = [{"input_ids": feature["labels"]} for feature in features]

184
185
186
...         batch = self.processor.pad(input_features, padding=self.padding, return_tensors="pt")

...         labels_batch = self.processor.pad(labels=label_features, padding=self.padding, return_tensors="pt")
Steven Liu's avatar
Steven Liu committed
187
188
189
190
191
192
193
194
195

...         # replace padding with -100 to ignore loss correctly
...         labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

...         batch["labels"] = labels

...         return batch
```

196
Now instantiate your `DataCollatorForCTCWithPadding`:
Steven Liu's avatar
Steven Liu committed
197
198

```py
199
200
201
202
203
204
205
206
207
208
209
>>> data_collator = DataCollatorCTCWithPadding(processor=processor, padding="longest")
```

## Evaluate

Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the 馃 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [word error rate](https://huggingface.co/spaces/evaluate-metric/wer) (WER) metric (see the 馃 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric):

```py
>>> import evaluate

>>> wer = evaluate.load("wer")
Steven Liu's avatar
Steven Liu committed
210
211
```

212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
Then create a function that passes your predictions and labels to [`~evaluate.EvaluationModule.compute`] to calculate the WER:

```py
>>> import numpy as np


>>> def compute_metrics(pred):
...     pred_logits = pred.predictions
...     pred_ids = np.argmax(pred_logits, axis=-1)

...     pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

...     pred_str = processor.batch_decode(pred_ids)
...     label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

...     wer = wer.compute(predictions=pred_str, references=label_str)

...     return {"wer": wer}
```

Your `compute_metrics` function is ready to go now, and you'll return to it when you setup your training.

234
## Train
Steven Liu's avatar
Steven Liu committed
235

236
237
<frameworkcontent>
<pt>
238
239
240
241
242
243
244
<Tip>

If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#train-with-pytorch-trainer)!

</Tip>

You're ready to start training your model now! Load Wav2Vec2 with [`AutoModelForCTC`]. Specify the reduction to apply with the `ctc_loss_reduction` parameter. It is often better to use the average instead of the default summation:
Steven Liu's avatar
Steven Liu committed
245
246
247
248
249

```py
>>> from transformers import AutoModelForCTC, TrainingArguments, Trainer

>>> model = AutoModelForCTC.from_pretrained(
250
...     "facebook/wav2vec2-base",
Steven Liu's avatar
Steven Liu committed
251
252
253
254
255
256
257
...     ctc_loss_reduction="mean",
...     pad_token_id=processor.tokenizer.pad_token_id,
... )
```

At this point, only three steps remain:

258
259
260
1. Define your training hyperparameters in [`TrainingArguments`]. The only required parameter is `output_dir` which specifies where to save your model. You'll push this model to the Hub by setting `push_to_hub=True` (you need to be signed in to Hugging Face to upload your model). At the end of each epoch, the [`Trainer`] will evaluate the WER and save the training checkpoint.
2. Pass the training arguments to [`Trainer`] along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [`~Trainer.train`] to finetune your model.
Steven Liu's avatar
Steven Liu committed
261
262
263

```py
>>> training_args = TrainingArguments(
264
265
266
267
268
269
270
271
...     output_dir="my_awesome_asr_mind_model",
...     per_device_train_batch_size=8,
...     gradient_accumulation_steps=2,
...     learning_rate=1e-5,
...     warmup_steps=500,
...     max_steps=2000,
...     gradient_checkpointing=True,
...     fp16=True,
Steven Liu's avatar
Steven Liu committed
272
...     group_by_length=True,
273
...     eval_strategy="steps",
274
275
276
277
278
279
280
281
...     per_device_eval_batch_size=8,
...     save_steps=1000,
...     eval_steps=1000,
...     logging_steps=25,
...     load_best_model_at_end=True,
...     metric_for_best_model="wer",
...     greater_is_better=False,
...     push_to_hub=True,
Steven Liu's avatar
Steven Liu committed
282
283
284
285
286
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
287
288
...     train_dataset=encoded_minds["train"],
...     eval_dataset=encoded_minds["test"],
289
...     tokenizer=processor,
Steven Liu's avatar
Steven Liu committed
290
...     data_collator=data_collator,
291
...     compute_metrics=compute_metrics,
Steven Liu's avatar
Steven Liu committed
292
293
294
295
... )

>>> trainer.train()
```
296
297
298
299
300
301

Once training is completed, share your model to the Hub with the [`~transformers.Trainer.push_to_hub`] method so everyone can use your model:

```py
>>> trainer.push_to_hub()
```
302
303
</pt>
</frameworkcontent>
Steven Liu's avatar
Steven Liu committed
304
305
306

<Tip>

307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
For a more in-depth example of how to finetune a model for automatic speech recognition, take a look at this blog [post](https://huggingface.co/blog/fine-tune-wav2vec2-english) for English ASR and this [post](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2) for multilingual ASR.

</Tip>

## Inference

Great, now that you've finetuned a model, you can use it for inference!

Load an audio file you'd like to run inference on. Remember to resample the sampling rate of the audio file to match the sampling rate of the model if you need to!

```py
>>> from datasets import load_dataset, Audio

>>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train")
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
>>> sampling_rate = dataset.features["audio"].sampling_rate
>>> audio_file = dataset[0]["audio"]["path"]
```

The simplest way to try out your finetuned model for inference is to use it in a [`pipeline`]. Instantiate a `pipeline` for automatic speech recognition with your model, and pass your audio file to it:

```py
>>> from transformers import pipeline

>>> transcriber = pipeline("automatic-speech-recognition", model="stevhliu/my_awesome_asr_minds_model")
>>> transcriber(audio_file)
{'text': 'I WOUD LIKE O SET UP JOINT ACOUNT WTH Y PARTNER'}
```

<Tip>

The transcription is decent, but it could be better! Try finetuning your model on more examples to get even better results!

</Tip>

You can also manually replicate the results of the `pipeline` if you'd like:

<frameworkcontent>
<pt>
Load a processor to preprocess the audio file and transcription and return the `input` as PyTorch tensors:

```py
>>> from transformers import AutoProcessor

>>> processor = AutoProcessor.from_pretrained("stevhliu/my_awesome_asr_mind_model")
>>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
```
Steven Liu's avatar
Steven Liu committed
354

355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
Pass your inputs to the model and return the logits:

```py
>>> from transformers import AutoModelForCTC

>>> model = AutoModelForCTC.from_pretrained("stevhliu/my_awesome_asr_mind_model")
>>> with torch.no_grad():
...     logits = model(**inputs).logits
```

Get the predicted `input_ids` with the highest probability, and use the processor to decode the predicted `input_ids` back into text:

```py
>>> import torch

>>> predicted_ids = torch.argmax(logits, dim=-1)
>>> transcription = processor.batch_decode(predicted_ids)
>>> transcription
['I WOUL LIKE O SET UP JOINT ACOUNT WTH Y PARTNER']
```
</pt>
</frameworkcontent>