"vscode:/vscode.git/clone" did not exist on "4def2fe9696b8cddda0cbeedfc16076d2b2167da"
preprocessing.mdx 21.6 KB
Newer Older
Steven Liu's avatar
Steven Liu committed
1
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
Sylvain Gugger's avatar
Sylvain Gugger committed
2
3
4
5
6
7
8
9
10
11
12

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

Steven Liu's avatar
Steven Liu committed
13
# Preprocess
Sylvain Gugger's avatar
Sylvain Gugger committed
14

15
16
[[open-in-colab]]

Steven Liu's avatar
Steven Liu committed
17
Before you can use your data in a model, the data needs to be processed into an acceptable format for the model. A model does not understand raw text, images or audio. These inputs need to be converted into numbers and assembled into tensors. In this tutorial, you will:
Sylvain Gugger's avatar
Sylvain Gugger committed
18

Steven Liu's avatar
Steven Liu committed
19
20
21
22
23
24
25
26
27
* Preprocess textual data with a tokenizer.
* Preprocess image or audio data with a feature extractor.
* Preprocess data for a multimodal task with a processor.

## NLP

<Youtube id="Yffk5aydLzg"/>

The main tool for processing textual data is a [tokenizer](main_classes/tokenizer). A tokenizer starts by splitting text into *tokens* according to a set of rules. The tokens are converted into numbers, which are used to build tensors as input to a model. Any additional inputs required by a model are also added by the tokenizer.
Sylvain Gugger's avatar
Sylvain Gugger committed
28
29
30

<Tip>

Steven Liu's avatar
Steven Liu committed
31
If you plan on using a pretrained model, it's important to use the associated pretrained tokenizer. This ensures the text is split the same way as the pretraining corpus, and uses the same corresponding tokens-to-index (usually referrred to as the *vocab*) during pretraining.
Sylvain Gugger's avatar
Sylvain Gugger committed
32
33
34

</Tip>

Steven Liu's avatar
Steven Liu committed
35
Get started quickly by loading a pretrained tokenizer with the [`AutoTokenizer`] class. This downloads the *vocab* used when a model is pretrained.
Sylvain Gugger's avatar
Sylvain Gugger committed
36

Steven Liu's avatar
Steven Liu committed
37
### Tokenize
Sylvain Gugger's avatar
Sylvain Gugger committed
38

Steven Liu's avatar
Steven Liu committed
39
Load a pretrained tokenizer with [`AutoTokenizer.from_pretrained`]:
Sylvain Gugger's avatar
Sylvain Gugger committed
40

Steven Liu's avatar
Steven Liu committed
41
42
```py
>>> from transformers import AutoTokenizer
Sylvain Gugger's avatar
Sylvain Gugger committed
43

Steven Liu's avatar
Steven Liu committed
44
45
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
```
Sylvain Gugger's avatar
Sylvain Gugger committed
46

Steven Liu's avatar
Steven Liu committed
47
Then pass your sentence to the tokenizer:
Sylvain Gugger's avatar
Sylvain Gugger committed
48
49

```py
Steven Liu's avatar
Steven Liu committed
50
>>> encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
Sylvain Gugger's avatar
Sylvain Gugger committed
51
>>> print(encoded_input)
Steven Liu's avatar
Steven Liu committed
52
53
54
{'input_ids': [101, 2079, 2025, 19960, 10362, 1999, 1996, 3821, 1997, 16657, 1010, 2005, 2027, 2024, 11259, 1998, 4248, 2000, 4963, 1012, 102], 
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Sylvain Gugger's avatar
Sylvain Gugger committed
55
56
```

Steven Liu's avatar
Steven Liu committed
57
58
59
60
61
The tokenizer returns a dictionary with three important itmes:

* [input_ids](glossary#input-ids) are the indices corresponding to each token in the sentence.
* [attention_mask](glossary#attention-mask) indicates whether a token should be attended to or not.
* [token_type_ids](glossary#token-type-ids) identifies which sequence a token belongs to when there is more than one sequence.
Sylvain Gugger's avatar
Sylvain Gugger committed
62

Steven Liu's avatar
Steven Liu committed
63
You can decode the `input_ids` to return the original input:
Sylvain Gugger's avatar
Sylvain Gugger committed
64
65
66

```py
>>> tokenizer.decode(encoded_input["input_ids"])
Steven Liu's avatar
Steven Liu committed
67
'[CLS] Do not meddle in the affairs of wizards, for they are subtle and quick to anger. [SEP]'
Sylvain Gugger's avatar
Sylvain Gugger committed
68
69
```

Steven Liu's avatar
Steven Liu committed
70
71
As you can see, the tokenizer added two special tokens - `CLS` and `SEP` (classifier and separator) - to the sentence. Not all models need
special tokens, but if they do, the tokenizer will automatically add them for you.
Sylvain Gugger's avatar
Sylvain Gugger committed
72

Steven Liu's avatar
Steven Liu committed
73
If there are several sentences you want to process, pass the sentences as a list to the tokenizer:
Sylvain Gugger's avatar
Sylvain Gugger committed
74
75

```py
Steven Liu's avatar
Steven Liu committed
76
77
78
79
80
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
Sylvain Gugger's avatar
Sylvain Gugger committed
81
82
>>> encoded_inputs = tokenizer(batch_sentences)
>>> print(encoded_inputs)
Steven Liu's avatar
Steven Liu committed
83
84
85
86
87
88
89
90
91
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1]]}
Sylvain Gugger's avatar
Sylvain Gugger committed
92
93
```

Steven Liu's avatar
Steven Liu committed
94
### Pad
Sylvain Gugger's avatar
Sylvain Gugger committed
95

Steven Liu's avatar
Steven Liu committed
96
This brings us to an important topic. When you process a batch of sentences, they aren't always the same length. This is a problem because tensors, the input to the model, need to have a uniform shape. Padding is a strategy for ensuring tensors are rectangular by adding a special *padding token* to sentences with fewer tokens.
Sylvain Gugger's avatar
Sylvain Gugger committed
97

Steven Liu's avatar
Steven Liu committed
98
Set the `padding` parameter to `True` to pad the shorter sequences in the batch to match the longest sequence:
Sylvain Gugger's avatar
Sylvain Gugger committed
99

Steven Liu's avatar
Steven Liu committed
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
```py
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_input = tokenizer(batch_sentences, padding=True)
>>> print(encoded_input)
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
```

Notice the tokenizer padded the first and third sentences with a `0` because they are shorter!

### Truncation

On the other end of the spectrum, sometimes a sequence may be too long for a model to handle. In this case, you will need to truncate the sequence to a shorter length.

Set the `truncation` parameter to `True` to truncate a sequence to the maximum length accepted by the model:
Sylvain Gugger's avatar
Sylvain Gugger committed
126
127

```py
Steven Liu's avatar
Steven Liu committed
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True)
>>> print(encoded_input)
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
```

### Build tensors

Finally, you want the tokenizer to return the actual tensors that are fed to the model.

Set the `return_tensors` parameter to either `pt` for PyTorch, or `tf` for TensorFlow:

Sylvain Gugger's avatar
Sylvain Gugger committed
152
153
154
<frameworkcontent>
<pt>

Steven Liu's avatar
Steven Liu committed
155
156
157
158
159
160
161
162
163
164
```py
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_input = tokenizer(batch, padding=True, truncation=True, return_tensors="pt")
>>> print(encoded_input)
{'input_ids': tensor([[  101,   153,  7719, 21490,  1122,  1114,  9582,  1623,   102],
                      [  101,  5226,  1122,  9649,  1199,  2610,  1236,   102,     0]]), 
Sylvain Gugger's avatar
Sylvain Gugger committed
165
166
167
168
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
                           [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
                           [1, 1, 1, 1, 1, 1, 1, 1, 0]])}
Sylvain Gugger's avatar
Sylvain Gugger committed
169
170
171
172
```
</pt>
<tf>
```py
Steven Liu's avatar
Steven Liu committed
173
174
175
176
177
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
178
>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf")
Steven Liu's avatar
Steven Liu committed
179
180
181
182
183
184
185
186
187
188
189
>>> print(encoded_input)
{'input_ids': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
array([[  101,   153,  7719, 21490,  1122,  1114,  9582,  1623,   102],
       [  101,  5226,  1122,  9649,  1199,  2610,  1236,   102,     0]],
      dtype=int32)>, 
 'token_type_ids': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>, 
 'attention_mask': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 0]], dtype=int32)>}
Sylvain Gugger's avatar
Sylvain Gugger committed
190
```
Sylvain Gugger's avatar
Sylvain Gugger committed
191
192
</tf>
</frameworkcontent>
Sylvain Gugger's avatar
Sylvain Gugger committed
193

Steven Liu's avatar
Steven Liu committed
194
## Audio
Sylvain Gugger's avatar
Sylvain Gugger committed
195

Steven Liu's avatar
Steven Liu committed
196
Audio inputs are preprocessed differently than textual inputs, but the end goal remains the same: create numerical sequences the model can understand. A [feature extractor](main_classes/feature_extractor) is designed for the express purpose of extracting features from raw image or audio data and converting them into tensors. Before you begin, install 馃 Datasets to load an audio dataset to experiment with:
Sylvain Gugger's avatar
Sylvain Gugger committed
197

Steven Liu's avatar
Steven Liu committed
198
199
200
201
202
```bash
pip install datasets
```

Load the keyword spotting task from the [SUPERB](https://huggingface.co/datasets/superb) benchmark (see the 馃 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset):
Sylvain Gugger's avatar
Sylvain Gugger committed
203

Steven Liu's avatar
Steven Liu committed
204
205
```py
>>> from datasets import load_dataset, Audio
Sylvain Gugger's avatar
Sylvain Gugger committed
206

Steven Liu's avatar
Steven Liu committed
207
208
>>> dataset = load_dataset("superb", "ks")
```
Sylvain Gugger's avatar
Sylvain Gugger committed
209

Steven Liu's avatar
Steven Liu committed
210
Access the first element of the `audio` column to take a look at the input. Calling the `audio` column will automatically load and resample the audio file:
Sylvain Gugger's avatar
Sylvain Gugger committed
211

Steven Liu's avatar
Steven Liu committed
212
213
214
215
216
217
218
```py
>>> dataset["train"][0]["audio"]
{'array': array([ 0.        ,  0.        ,  0.        , ..., -0.00592041,
        -0.00405884, -0.00253296], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/05734a36d88019a09725c20cc024e1c4e7982e37d7d55c0c1ca1742ea1cdd47f/_background_noise_/doing_the_dishes.wav',
 'sampling_rate': 16000}
```
Sylvain Gugger's avatar
Sylvain Gugger committed
219

Steven Liu's avatar
Steven Liu committed
220
221
222
223
224
225
226
227
228
229
230
This returns three items:

* `array` is the speech signal loaded - and potentially resampled - as a 1D array.
* `path` points to the location of the audio file.
* `sampling_rate` refers to how many data points in the speech signal are measured per second.

### Resample

For this tutorial, you will use the [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) model. As you can see from the model card, the Wav2Vec2 model is pretrained on 16kHz sampled speech audio. It is important your audio data's sampling rate matches the sampling rate of the dataset used to pretrain the model. If your data's sampling rate isn't the same, then you need to resample your audio data. 

For example, load the [LJ Speech](https://huggingface.co/datasets/lj_speech) dataset which has a sampling rate of 22050kHz. In order to use the Wav2Vec2 model with this dataset, downsample the sampling rate to 16kHz:
Sylvain Gugger's avatar
Sylvain Gugger committed
231
232

```py
Steven Liu's avatar
Steven Liu committed
233
234
235
236
237
238
>>> lj_speech = load_dataset("lj_speech", split="train")
>>> lj_speech[0]["audio"]
{'array': array([-7.3242188e-04, -7.6293945e-04, -6.4086914e-04, ...,
         7.3242188e-04,  2.1362305e-04,  6.1035156e-05], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/917ece08c95cf0c4115e45294e3cd0dee724a1165b7fc11798369308a465bd26/LJSpeech-1.1/wavs/LJ001-0001.wav',
 'sampling_rate': 22050}
Sylvain Gugger's avatar
Sylvain Gugger committed
239
240
```

Steven Liu's avatar
Steven Liu committed
241
1. Use 馃 Datasets' [`cast_column`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.cast_column) method to downsample the sampling rate to 16kHz:
Sylvain Gugger's avatar
Sylvain Gugger committed
242

Steven Liu's avatar
Steven Liu committed
243
244
245
246
247
```py
>>> lj_speech = lj_speech.cast_column("audio", Audio(sampling_rate=16_000))
```

2. Load the audio file:
Sylvain Gugger's avatar
Sylvain Gugger committed
248
249

```py
Steven Liu's avatar
Steven Liu committed
250
251
252
253
254
>>> lj_speech[0]["audio"]
{'array': array([-0.00064146, -0.00074657, -0.00068768, ...,  0.00068341,
         0.00014045,  0.        ], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/917ece08c95cf0c4115e45294e3cd0dee724a1165b7fc11798369308a465bd26/LJSpeech-1.1/wavs/LJ001-0001.wav',
 'sampling_rate': 16000}
Sylvain Gugger's avatar
Sylvain Gugger committed
255
256
```

Steven Liu's avatar
Steven Liu committed
257
258
259
260
261
262
263
As you can see, the `sampling_rate` was downsampled to 16kHz. Now that you know how resampling works, let's return to our previous example with the SUPERB dataset!

### Feature extractor

The next step is to load a feature extractor to normalize and pad the input. When padding textual data, a `0` is added for shorter sequences. The same idea applies to audio data, and the audio feature extractor will add a `0` - interpreted as silence - to `array`.

Load the feature extractor with [`AutoFeatureExtractor.from_pretrained`]:
Sylvain Gugger's avatar
Sylvain Gugger committed
264
265

```py
Steven Liu's avatar
Steven Liu committed
266
267
268
269
270
271
272
273
274
275
276
>>> from transformers import AutoFeatureExtractor

>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")
```

Pass the audio `array` to the feature extractor. We also recommend adding the `sampling_rate` argument in the feature extractor in order to better debug any silent errors that may occur.

```py
>>> audio_input = [dataset["train"][0]["audio"]["array"]]
>>> feature_extractor(audio_input, sampling_rate=16000)
{'input_values': [array([ 0.00045439,  0.00045439,  0.00045439, ..., -0.1578519 , -0.10807519, -0.06727459], dtype=float32)]}
Sylvain Gugger's avatar
Sylvain Gugger committed
277
278
```

Steven Liu's avatar
Steven Liu committed
279
### Pad and truncate
Sylvain Gugger's avatar
Sylvain Gugger committed
280

Steven Liu's avatar
Steven Liu committed
281
Just like the tokenizer, you can apply padding or truncation to handle variable sequences in a batch. Take a look at the sequence length of these two audio samples:
Sylvain Gugger's avatar
Sylvain Gugger committed
282
283

```py
Steven Liu's avatar
Steven Liu committed
284
285
286
287
288
>>> dataset["train"][0]["audio"]["array"].shape
(1522930,)

>>> dataset["train"][1]["audio"]["array"].shape
(988891,)
Sylvain Gugger's avatar
Sylvain Gugger committed
289
290
```

Steven Liu's avatar
Steven Liu committed
291
As you can see, the first sample has a longer sequence than the second sample. Let's create a function that will preprocess the dataset. Specify a maximum sample length, and the feature extractor will either pad or truncate the sequences to match it:
Sylvain Gugger's avatar
Sylvain Gugger committed
292
293

```py
Steven Liu's avatar
Steven Liu committed
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
>>> def preprocess_function(examples):
...     audio_arrays = [x["array"] for x in examples["audio"]]
...     inputs = feature_extractor(
...         audio_arrays,
...         sampling_rate=16000,
...         padding=True,
...         max_length=1000000,
...         truncation=True,
...     )
...     return inputs
```

Apply the function to the the first few examples in the dataset:

```py
>>> processed_dataset = preprocess_function(dataset["train"][:5])
```

Now take another look at the processed sample lengths:

```py
>>> processed_dataset["input_values"][0].shape
(1000000,)

>>> processed_dataset["input_values"][1].shape
(1000000,)
```

The lengths of the first two samples now match the maximum length you specified.

## Vision

A feature extractor is also used to process images for vision tasks. Once again, the goal is to convert the raw image into a batch of tensors as input.

Let's load the [food101](https://huggingface.co/datasets/food101) dataset for this tutorial. Use 馃 Datasets `split` parameter to only load a small sample from the training split since the dataset is quite large:

```py
>>> from datasets import load_dataset

>>> dataset = load_dataset("food101", split="train[:100]")
```

Next, take a look at the image with 馃 Datasets [`Image`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=image#datasets.Image) feature:

```py
>>> dataset[0]["image"]
```

![vision-preprocess-tutorial.png](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vision-preprocess-tutorial.png)

### Feature extractor

Load the feature extractor with [`AutoFeatureExtractor.from_pretrained`]:

```py
>>> from transformers import AutoFeatureExtractor

>>> feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224")
Sylvain Gugger's avatar
Sylvain Gugger committed
352
353
```

Steven Liu's avatar
Steven Liu committed
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
### Data augmentation

For vision tasks, it is common to add some type of data augmentation to the images as a part of preprocessing. You can add augmentations with any library you'd like, but in this tutorial, you will use torchvision's [`transforms`](https://pytorch.org/vision/stable/transforms.html) module.

1. Normalize the image and use [`Compose`](https://pytorch.org/vision/master/generated/torchvision.transforms.Compose.html) to chain some transforms - [`RandomResizedCrop`](https://pytorch.org/vision/main/generated/torchvision.transforms.RandomResizedCrop.html) and [`ColorJitter`](https://pytorch.org/vision/main/generated/torchvision.transforms.ColorJitter.html) - together:

```py
>>> from torchvision.transforms import Compose, Normalize, RandomResizedCrop, ColorJitter, ToTensor

>>> normalize = Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std)
>>> _transforms = Compose(
...     [RandomResizedCrop(feature_extractor.size), ColorJitter(brightness=0.5, hue=0.5), ToTensor(), normalize]
... )
```

2. The model accepts [`pixel_values`](model_doc/visionencoderdecoder#transformers.VisionEncoderDecoderModel.forward.pixel_values) as it's input. This value is generated by the feature extractor. Create a function that generates `pixel_values` from the transforms:

```py
>>> def transforms(examples):
...     examples["pixel_values"] = [_transforms(image.convert("RGB")) for image in examples["image"]]
...     return examples
```

3. Then use 馃 Datasets [`set_transform`](https://huggingface.co/docs/datasets/process.html#format-transform) to apply the transforms on-the-fly:

```py
>>> dataset.set_transform(transforms)
```

4. Now when you access the image, you will notice the feature extractor has added the model input `pixel_values`:

```py
>>> dataset[0]["image"]
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F1A7B0630D0>,
 'label': 6,
 'pixel_values': tensor([[[ 0.0353,  0.0745,  0.1216,  ..., -0.9922, -0.9922, -0.9922],
          [-0.0196,  0.0667,  0.1294,  ..., -0.9765, -0.9843, -0.9922],
          [ 0.0196,  0.0824,  0.1137,  ..., -0.9765, -0.9686, -0.8667],
          ...,
          [ 0.0275,  0.0745,  0.0510,  ..., -0.1137, -0.1216, -0.0824],
          [ 0.0667,  0.0824,  0.0667,  ..., -0.0588, -0.0745, -0.0980],
          [ 0.0353,  0.0353,  0.0431,  ..., -0.0039, -0.0039, -0.0588]],
 
         [[ 0.2078,  0.2471,  0.2863,  ..., -0.9451, -0.9373, -0.9451],
          [ 0.1608,  0.2471,  0.3098,  ..., -0.9373, -0.9451, -0.9373],
          [ 0.2078,  0.2706,  0.3020,  ..., -0.9608, -0.9373, -0.8275],
          ...,
          [-0.0353,  0.0118, -0.0039,  ..., -0.2392, -0.2471, -0.2078],
          [ 0.0196,  0.0353,  0.0196,  ..., -0.1843, -0.2000, -0.2235],
          [-0.0118, -0.0039, -0.0039,  ..., -0.0980, -0.0980, -0.1529]],
 
         [[ 0.3961,  0.4431,  0.4980,  ..., -0.9216, -0.9137, -0.9216],
          [ 0.3569,  0.4510,  0.5216,  ..., -0.9059, -0.9137, -0.9137],
          [ 0.4118,  0.4745,  0.5216,  ..., -0.9137, -0.8902, -0.7804],
          ...,
          [-0.2314, -0.1922, -0.2078,  ..., -0.4196, -0.4275, -0.3882],
          [-0.1843, -0.1686, -0.2000,  ..., -0.3647, -0.3804, -0.4039],
          [-0.1922, -0.1922, -0.1922,  ..., -0.2941, -0.2863, -0.3412]]])}
```

Here is what the image looks like after you preprocess it. Just as you'd expect from the applied transforms, the image has been randomly cropped and it's color properties are different.

```py
>>> import numpy as np
>>> import matplotlib.pyplot as plt

>>> img = dataset[0]["pixel_values"]
>>> plt.imshow(img.permute(1, 2, 0))
```

![preprocessed_image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/preprocessed_image.png)

## Multimodal

For multimodal tasks. you will use a combination of everything you've learned so far and apply your skills to a automatic speech recognition (ASR) task. This means you will need a:

* Feature extractor to preprocess the audio data.
* Tokenizer to process the text.

Let's return to the [LJ Speech](https://huggingface.co/datasets/lj_speech) dataset:

```py
>>> from datasets import load_dataset

>>> lj_speech = load_dataset("lj_speech", split="train")
```

Since you are mainly interested in the `audio` and `text` column, remove the other columns:

```py
>>> lj_speech = lj_speech.map(remove_columns=["file", "id", "normalized_text"])
```

Now take a look at the `audio` and `text` columns:

```py
>>> lj_speech[0]["audio"]
{'array': array([-7.3242188e-04, -7.6293945e-04, -6.4086914e-04, ...,
         7.3242188e-04,  2.1362305e-04,  6.1035156e-05], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/917ece08c95cf0c4115e45294e3cd0dee724a1165b7fc11798369308a465bd26/LJSpeech-1.1/wavs/LJ001-0001.wav',
 'sampling_rate': 22050}

>>> lj_speech[0]["text"]
'Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition'
```

Remember from the earlier section on processing audio data, you should always [resample](preprocessing#audio) your audio data's sampling rate to match the sampling rate of the dataset used to pretrain a model:

```py
>>> lj_speech = lj_speech.cast_column("audio", Audio(sampling_rate=16_000))
```

### Processor

A processor combines a feature extractor and tokenizer. Load a processor with [`AutoProcessor.from_pretrained]:

```py
>>> from transformers import AutoProcessor

>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
```

1. Create a function to process the audio data to `input_values`, and tokenizes the text to `labels`. These are your inputs to the model:

```py
>>> def prepare_dataset(example):
...     audio = example["audio"]

...     example["input_values"] = processor(audio["array"], sampling_rate=16000)

...     with processor.as_target_processor():
...         example["labels"] = processor(example["text"]).input_ids
...     return example
```

2. Apply the `prepare_dataset` function to a sample:

```py
>>> prepare_dataset(lj_speech[0])
```

Notice the processor has added `input_values` and `labels`. The sampling rate has also been correctly downsampled to 16kHz.

497
Awesome, you should now be able to preprocess data for any modality and even combine different modalities! In the next tutorial, learn how to fine-tune a model on your newly preprocessed data.