preprocessing.mdx 22.3 KB
Newer Older
Steven Liu's avatar
Steven Liu committed
1
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
Sylvain Gugger's avatar
Sylvain Gugger committed
2
3
4
5
6
7
8
9
10
11
12

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

Steven Liu's avatar
Steven Liu committed
13
# Preprocess
Sylvain Gugger's avatar
Sylvain Gugger committed
14

15
16
[[open-in-colab]]

17
Before you can train a model on a dataset, it needs to be preprocessed into the expected model input format. Whether your data is text, images, or audio, they need to be converted and assembled into batches of tensors. 馃 Transformers provides a set of preprocessing classes to help prepare your data for the model. In this tutorial, you'll learn that for:
Sylvain Gugger's avatar
Sylvain Gugger committed
18

19
20
21
* Text, use a [Tokenizer](./main_classes/tokenizer) to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors.
* Computer vision and speech, use a [Feature extractor](./main_classes/feature_extractor) to extract sequential features from audio waveforms and images and convert them into tensors.
* Multimodal inputs, use a [Processor](./main_classes/processors) to combine a tokenizer and a feature extractor.
Steven Liu's avatar
Steven Liu committed
22

23
24
25
26
27
28
29
30
31
32
33
34
35
<Tip>

`AutoProcessor` **always** works and automatically chooses the correct class for the model you're using, whether you're using a tokenizer, feature extractor or processor.

</Tip>

Before you begin, install 馃 Datasets so you can load some datasets to experiment with:

```bash
pip install datasets
```

## Natural Language Processing
Steven Liu's avatar
Steven Liu committed
36
37
38

<Youtube id="Yffk5aydLzg"/>

39
The main tool for preprocessing textual data is a [tokenizer](main_classes/tokenizer). A tokenizer splits text into *tokens* according to a set of rules. The tokens are converted into numbers and then tensors, which become the model inputs. Any additional inputs required by the model are added by the tokenizer.
Sylvain Gugger's avatar
Sylvain Gugger committed
40
41
42

<Tip>

Steven Liu's avatar
Steven Liu committed
43
If you plan on using a pretrained model, it's important to use the associated pretrained tokenizer. This ensures the text is split the same way as the pretraining corpus, and uses the same corresponding tokens-to-index (usually referrred to as the *vocab*) during pretraining.
Sylvain Gugger's avatar
Sylvain Gugger committed
44
45
46

</Tip>

47
Get started by loading a pretrained tokenizer with the [`AutoTokenizer.from_pretrained`] method. This downloads the *vocab* a model was pretrained with:
Sylvain Gugger's avatar
Sylvain Gugger committed
48

Steven Liu's avatar
Steven Liu committed
49
50
```py
>>> from transformers import AutoTokenizer
Sylvain Gugger's avatar
Sylvain Gugger committed
51

Steven Liu's avatar
Steven Liu committed
52
53
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
```
Sylvain Gugger's avatar
Sylvain Gugger committed
54

55
Then pass your text to the tokenizer:
Sylvain Gugger's avatar
Sylvain Gugger committed
56
57

```py
Steven Liu's avatar
Steven Liu committed
58
>>> encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
Sylvain Gugger's avatar
Sylvain Gugger committed
59
>>> print(encoded_input)
Steven Liu's avatar
Steven Liu committed
60
61
62
{'input_ids': [101, 2079, 2025, 19960, 10362, 1999, 1996, 3821, 1997, 16657, 1010, 2005, 2027, 2024, 11259, 1998, 4248, 2000, 4963, 1012, 102], 
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Sylvain Gugger's avatar
Sylvain Gugger committed
63
64
```

Tobias Nusser's avatar
Tobias Nusser committed
65
The tokenizer returns a dictionary with three important items:
Steven Liu's avatar
Steven Liu committed
66
67
68
69

* [input_ids](glossary#input-ids) are the indices corresponding to each token in the sentence.
* [attention_mask](glossary#attention-mask) indicates whether a token should be attended to or not.
* [token_type_ids](glossary#token-type-ids) identifies which sequence a token belongs to when there is more than one sequence.
Sylvain Gugger's avatar
Sylvain Gugger committed
70

71
Return your input by decoding the `input_ids`:
Sylvain Gugger's avatar
Sylvain Gugger committed
72
73
74

```py
>>> tokenizer.decode(encoded_input["input_ids"])
Steven Liu's avatar
Steven Liu committed
75
'[CLS] Do not meddle in the affairs of wizards, for they are subtle and quick to anger. [SEP]'
Sylvain Gugger's avatar
Sylvain Gugger committed
76
77
```

Steven Liu's avatar
Steven Liu committed
78
As you can see, the tokenizer added two special tokens - `CLS` and `SEP` (classifier and separator) - to the sentence. Not all models need
79
special tokens, but if they do, the tokenizer automatically adds them for you.
Sylvain Gugger's avatar
Sylvain Gugger committed
80

81
If there are several sentences you want to preprocess, pass them as a list to the tokenizer:
Sylvain Gugger's avatar
Sylvain Gugger committed
82
83

```py
Steven Liu's avatar
Steven Liu committed
84
85
86
87
88
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
Sylvain Gugger's avatar
Sylvain Gugger committed
89
90
>>> encoded_inputs = tokenizer(batch_sentences)
>>> print(encoded_inputs)
Steven Liu's avatar
Steven Liu committed
91
92
93
94
95
96
97
98
99
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1]]}
Sylvain Gugger's avatar
Sylvain Gugger committed
100
101
```

Steven Liu's avatar
Steven Liu committed
102
### Pad
Sylvain Gugger's avatar
Sylvain Gugger committed
103

104
Sentences aren't always the same length which can be an issue because tensors, the model inputs, need to have a uniform shape. Padding is a strategy for ensuring tensors are rectangular by adding a special *padding token* to shorter sentences.
Sylvain Gugger's avatar
Sylvain Gugger committed
105

Steven Liu's avatar
Steven Liu committed
106
Set the `padding` parameter to `True` to pad the shorter sequences in the batch to match the longest sequence:
Sylvain Gugger's avatar
Sylvain Gugger committed
107

Steven Liu's avatar
Steven Liu committed
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
```py
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_input = tokenizer(batch_sentences, padding=True)
>>> print(encoded_input)
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
```

127
The first and third sentences are now padded with `0`'s because they are shorter.
Steven Liu's avatar
Steven Liu committed
128
129
130

### Truncation

131
On the other end of the spectrum, sometimes a sequence may be too long for a model to handle. In this case, you'll need to truncate the sequence to a shorter length.
Steven Liu's avatar
Steven Liu committed
132
133

Set the `truncation` parameter to `True` to truncate a sequence to the maximum length accepted by the model:
Sylvain Gugger's avatar
Sylvain Gugger committed
134
135

```py
Steven Liu's avatar
Steven Liu committed
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True)
>>> print(encoded_input)
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
```

154
155
156
157
158
159
<Tip>

Check out the [Padding and truncation](./pad_truncation) concept guide to learn more different padding and truncation arguments.

</Tip>

Steven Liu's avatar
Steven Liu committed
160
161
### Build tensors

162
Finally, you want the tokenizer to return the actual tensors that get fed to the model.
Steven Liu's avatar
Steven Liu committed
163
164
165

Set the `return_tensors` parameter to either `pt` for PyTorch, or `tf` for TensorFlow:

Sylvain Gugger's avatar
Sylvain Gugger committed
166
167
168
<frameworkcontent>
<pt>

Steven Liu's avatar
Steven Liu committed
169
170
171
172
173
174
```py
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
175
>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
Steven Liu's avatar
Steven Liu committed
176
>>> print(encoded_input)
177
178
179
180
181
182
183
184
185
{'input_ids': tensor([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
                      [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
                      [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]]), 
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
                           [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                           [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}
Sylvain Gugger's avatar
Sylvain Gugger committed
186
187
188
189
```
</pt>
<tf>
```py
Steven Liu's avatar
Steven Liu committed
190
191
192
193
194
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
195
>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf")
Steven Liu's avatar
Steven Liu committed
196
197
>>> print(encoded_input)
{'input_ids': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
198
199
200
array([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
       [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
       [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
Steven Liu's avatar
Steven Liu committed
201
202
      dtype=int32)>, 
 'token_type_ids': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
203
204
205
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>, 
Steven Liu's avatar
Steven Liu committed
206
 'attention_mask': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
207
208
209
array([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>}
Sylvain Gugger's avatar
Sylvain Gugger committed
210
```
Sylvain Gugger's avatar
Sylvain Gugger committed
211
212
</tf>
</frameworkcontent>
Sylvain Gugger's avatar
Sylvain Gugger committed
213

Steven Liu's avatar
Steven Liu committed
214
## Audio
Sylvain Gugger's avatar
Sylvain Gugger committed
215

216
For audio tasks, you'll need a [feature extractor](main_classes/feature_extractor) to prepare your dataset for the model. The feature extractor is designed to extract features from raw audio data, and convert them into tensors.
Sylvain Gugger's avatar
Sylvain Gugger committed
217

218
Load the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset (see the 馃 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset) to see how you can use a feature extractor with audio datasets:
Sylvain Gugger's avatar
Sylvain Gugger committed
219

Steven Liu's avatar
Steven Liu committed
220
221
```py
>>> from datasets import load_dataset, Audio
Sylvain Gugger's avatar
Sylvain Gugger committed
222

223
>>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")
Steven Liu's avatar
Steven Liu committed
224
```
Sylvain Gugger's avatar
Sylvain Gugger committed
225

226
Access the first element of the `audio` column to take a look at the input. Calling the `audio` column automatically loads and resamples the audio file:
Sylvain Gugger's avatar
Sylvain Gugger committed
227

Steven Liu's avatar
Steven Liu committed
228
```py
229
230
231
232
233
>>> dataset[0]["audio"]
{'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,
         0.        ,  0.        ], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 'sampling_rate': 8000}
Steven Liu's avatar
Steven Liu committed
234
```
Sylvain Gugger's avatar
Sylvain Gugger committed
235

Steven Liu's avatar
Steven Liu committed
236
237
238
239
240
241
This returns three items:

* `array` is the speech signal loaded - and potentially resampled - as a 1D array.
* `path` points to the location of the audio file.
* `sampling_rate` refers to how many data points in the speech signal are measured per second.

242
For this tutorial, you'll use the [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) model. Take a look at the model card, and you'll learn Wav2Vec2 is pretrained on 16kHz sampled speech audio. It is important your audio data's sampling rate matches the sampling rate of the dataset used to pretrain the model. If your data's sampling rate isn't the same, then you need to resample your data. 
Sylvain Gugger's avatar
Sylvain Gugger committed
243

Sylvain Gugger's avatar
Sylvain Gugger committed
244
1. Use 馃 Datasets' [`~datasets.Dataset.cast_column`] method to upsample the sampling rate to 16kHz:
Sylvain Gugger's avatar
Sylvain Gugger committed
245

Steven Liu's avatar
Steven Liu committed
246
```py
247
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
Steven Liu's avatar
Steven Liu committed
248
249
```

250
2. Call the `audio` column again to resample the audio file:
Sylvain Gugger's avatar
Sylvain Gugger committed
251
252

```py
253
254
255
256
>>> dataset[0]["audio"]
{'array': array([ 2.3443763e-05,  2.1729663e-04,  2.2145823e-04, ...,
         3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
Steven Liu's avatar
Steven Liu committed
257
 'sampling_rate': 16000}
Sylvain Gugger's avatar
Sylvain Gugger committed
258
259
```

260
Next, load a feature extractor to normalize and pad the input. When padding textual data, a `0` is added for shorter sequences. The same idea applies to audio data. The feature extractor adds a `0` - interpreted as silence - to `array`.
Steven Liu's avatar
Steven Liu committed
261
262

Load the feature extractor with [`AutoFeatureExtractor.from_pretrained`]:
Sylvain Gugger's avatar
Sylvain Gugger committed
263
264

```py
Steven Liu's avatar
Steven Liu committed
265
266
267
268
269
270
271
272
>>> from transformers import AutoFeatureExtractor

>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")
```

Pass the audio `array` to the feature extractor. We also recommend adding the `sampling_rate` argument in the feature extractor in order to better debug any silent errors that may occur.

```py
273
>>> audio_input = [dataset[0]["audio"]["array"]]
Steven Liu's avatar
Steven Liu committed
274
>>> feature_extractor(audio_input, sampling_rate=16000)
275
276
{'input_values': [array([ 3.8106556e-04,  2.7506407e-03,  2.8015103e-03, ...,
        5.6335266e-04,  4.6588284e-06, -1.7142107e-04], dtype=float32)]}
Sylvain Gugger's avatar
Sylvain Gugger committed
277
278
```

Steven Liu's avatar
Steven Liu committed
279
Just like the tokenizer, you can apply padding or truncation to handle variable sequences in a batch. Take a look at the sequence length of these two audio samples:
Sylvain Gugger's avatar
Sylvain Gugger committed
280
281

```py
282
283
>>> dataset[0]["audio"]["array"].shape
(173398,)
Steven Liu's avatar
Steven Liu committed
284

285
286
>>> dataset[1]["audio"]["array"].shape
(106496,)
Sylvain Gugger's avatar
Sylvain Gugger committed
287
288
```

289
Create a function to preprocess the dataset so the audio samples are the same lengths. Specify a maximum sample length, and the feature extractor will either pad or truncate the sequences to match it:
Sylvain Gugger's avatar
Sylvain Gugger committed
290
291

```py
Steven Liu's avatar
Steven Liu committed
292
293
294
295
296
297
>>> def preprocess_function(examples):
...     audio_arrays = [x["array"] for x in examples["audio"]]
...     inputs = feature_extractor(
...         audio_arrays,
...         sampling_rate=16000,
...         padding=True,
298
...         max_length=100000,
Steven Liu's avatar
Steven Liu committed
299
300
301
302
303
...         truncation=True,
...     )
...     return inputs
```

304
Apply the `preprocess_function` to the the first few examples in the dataset:
Steven Liu's avatar
Steven Liu committed
305
306

```py
307
>>> processed_dataset = preprocess_function(dataset[:5])
Steven Liu's avatar
Steven Liu committed
308
309
```

310
The sample lengths are now the same and match the specified maximum length. You can pass your processed dataset to the model now!
Steven Liu's avatar
Steven Liu committed
311
312
313

```py
>>> processed_dataset["input_values"][0].shape
314
(100000,)
Steven Liu's avatar
Steven Liu committed
315
316

>>> processed_dataset["input_values"][1].shape
317
(100000,)
Steven Liu's avatar
Steven Liu committed
318
319
```

320
321
322
## Computer vision

For computer vision tasks, you'll need a [feature extractor](main_classes/feature_extractor) to prepare your dataset for the model. The feature extractor is designed to extract features from images, and convert them into tensors.
Steven Liu's avatar
Steven Liu committed
323

324
Load the [food101](https://huggingface.co/datasets/food101) dataset (see the 馃 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset) to see how you can use a feature extractor with computer vision datasets: 
Steven Liu's avatar
Steven Liu committed
325

326
327
328
<Tip>

Use 馃 Datasets `split` parameter to only load a small sample from the training split since the dataset is quite large!
Steven Liu's avatar
Steven Liu committed
329

330
</Tip>
Steven Liu's avatar
Steven Liu committed
331
332
333
334
335
336
337
338
339
340
341
342
343

```py
>>> from datasets import load_dataset

>>> dataset = load_dataset("food101", split="train[:100]")
```

Next, take a look at the image with 馃 Datasets [`Image`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=image#datasets.Image) feature:

```py
>>> dataset[0]["image"]
```

344
345
346
<div class="flex justify-center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vision-preprocess-tutorial.png"/>
</div>
Steven Liu's avatar
Steven Liu committed
347
348
349
350
351
352
353

Load the feature extractor with [`AutoFeatureExtractor.from_pretrained`]:

```py
>>> from transformers import AutoFeatureExtractor

>>> feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224")
Sylvain Gugger's avatar
Sylvain Gugger committed
354
355
```

356
For computer vision tasks, it is common to add some type of data augmentation to the images as a part of preprocessing. You can add augmentations with any library you'd like, but in this tutorial, you'll use torchvision's [`transforms`](https://pytorch.org/vision/stable/transforms.html) module. If you're interested in using another data augmentation library, learn how in the [Albumentations](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_albumentations.ipynb) or [Kornia notebooks](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_kornia.ipynb).
Steven Liu's avatar
Steven Liu committed
357

358
1. Normalize the image with the feature extractor and use [`Compose`](https://pytorch.org/vision/master/generated/torchvision.transforms.Compose.html) to chain some transforms - [`RandomResizedCrop`](https://pytorch.org/vision/main/generated/torchvision.transforms.RandomResizedCrop.html) and [`ColorJitter`](https://pytorch.org/vision/main/generated/torchvision.transforms.ColorJitter.html) - together:
Steven Liu's avatar
Steven Liu committed
359
360
361
362
363

```py
>>> from torchvision.transforms import Compose, Normalize, RandomResizedCrop, ColorJitter, ToTensor

>>> normalize = Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std)
amyeroberts's avatar
amyeroberts committed
364
365
366
367
>>> size = (
...     feature_extractor.size["shortest_edge"]
...     if "shortest_edge" in feature_extractor.size
...     else (feature_extractor.size["height"], feature_extractor.size["width"])
Steven Liu's avatar
Steven Liu committed
368
... )
amyeroberts's avatar
amyeroberts committed
369
>>> _transforms = Compose([RandomResizedCrop(size), ColorJitter(brightness=0.5, hue=0.5), ToTensor(), normalize])
Steven Liu's avatar
Steven Liu committed
370
371
```

372
2. The model accepts [`pixel_values`](model_doc/visionencoderdecoder#transformers.VisionEncoderDecoderModel.forward.pixel_values) as its input, which is generated by the feature extractor. Create a function that generates `pixel_values` from the transforms:
Steven Liu's avatar
Steven Liu committed
373
374
375
376
377
378
379

```py
>>> def transforms(examples):
...     examples["pixel_values"] = [_transforms(image.convert("RGB")) for image in examples["image"]]
...     return examples
```

380
3. Then use 馃 Datasets [`set_transform`](https://huggingface.co/docs/datasets/process.html#format-transform) to apply the transforms on the fly:
Steven Liu's avatar
Steven Liu committed
381
382
383
384
385

```py
>>> dataset.set_transform(transforms)
```

386
4. Now when you access the image, you'll notice the feature extractor has added `pixel_values`. You can pass your processed dataset to the model now!
Steven Liu's avatar
Steven Liu committed
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416

```py
>>> dataset[0]["image"]
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F1A7B0630D0>,
 'label': 6,
 'pixel_values': tensor([[[ 0.0353,  0.0745,  0.1216,  ..., -0.9922, -0.9922, -0.9922],
          [-0.0196,  0.0667,  0.1294,  ..., -0.9765, -0.9843, -0.9922],
          [ 0.0196,  0.0824,  0.1137,  ..., -0.9765, -0.9686, -0.8667],
          ...,
          [ 0.0275,  0.0745,  0.0510,  ..., -0.1137, -0.1216, -0.0824],
          [ 0.0667,  0.0824,  0.0667,  ..., -0.0588, -0.0745, -0.0980],
          [ 0.0353,  0.0353,  0.0431,  ..., -0.0039, -0.0039, -0.0588]],
 
         [[ 0.2078,  0.2471,  0.2863,  ..., -0.9451, -0.9373, -0.9451],
          [ 0.1608,  0.2471,  0.3098,  ..., -0.9373, -0.9451, -0.9373],
          [ 0.2078,  0.2706,  0.3020,  ..., -0.9608, -0.9373, -0.8275],
          ...,
          [-0.0353,  0.0118, -0.0039,  ..., -0.2392, -0.2471, -0.2078],
          [ 0.0196,  0.0353,  0.0196,  ..., -0.1843, -0.2000, -0.2235],
          [-0.0118, -0.0039, -0.0039,  ..., -0.0980, -0.0980, -0.1529]],
 
         [[ 0.3961,  0.4431,  0.4980,  ..., -0.9216, -0.9137, -0.9216],
          [ 0.3569,  0.4510,  0.5216,  ..., -0.9059, -0.9137, -0.9137],
          [ 0.4118,  0.4745,  0.5216,  ..., -0.9137, -0.8902, -0.7804],
          ...,
          [-0.2314, -0.1922, -0.2078,  ..., -0.4196, -0.4275, -0.3882],
          [-0.1843, -0.1686, -0.2000,  ..., -0.3647, -0.3804, -0.4039],
          [-0.1922, -0.1922, -0.1922,  ..., -0.2941, -0.2863, -0.3412]]])}
```

417
Here is what the image looks like after the transforms are applied. The image has been randomly cropped and it's color properties are different.
Steven Liu's avatar
Steven Liu committed
418
419
420
421
422
423
424
425
426

```py
>>> import numpy as np
>>> import matplotlib.pyplot as plt

>>> img = dataset[0]["pixel_values"]
>>> plt.imshow(img.permute(1, 2, 0))
```

427
428
429
<div class="flex justify-center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/preprocessed_image.png"/>
</div>
Steven Liu's avatar
Steven Liu committed
430
431
432

## Multimodal

433
For tasks involving multimodal inputs, you'll need a [processor](main_classes/processors) to prepare your dataset for the model. A processor couples a tokenizer and feature extractor.
Steven Liu's avatar
Steven Liu committed
434

435
Load the [LJ Speech](https://huggingface.co/datasets/lj_speech) dataset (see the 馃 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset) to see how you can use a processor for automatic speech recognition (ASR):
Steven Liu's avatar
Steven Liu committed
436
437
438
439
440
441
442

```py
>>> from datasets import load_dataset

>>> lj_speech = load_dataset("lj_speech", split="train")
```

443
For ASR, you're mainly focused on `audio` and `text` so you can remove the other columns:
Steven Liu's avatar
Steven Liu committed
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461

```py
>>> lj_speech = lj_speech.map(remove_columns=["file", "id", "normalized_text"])
```

Now take a look at the `audio` and `text` columns:

```py
>>> lj_speech[0]["audio"]
{'array': array([-7.3242188e-04, -7.6293945e-04, -6.4086914e-04, ...,
         7.3242188e-04,  2.1362305e-04,  6.1035156e-05], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/917ece08c95cf0c4115e45294e3cd0dee724a1165b7fc11798369308a465bd26/LJSpeech-1.1/wavs/LJ001-0001.wav',
 'sampling_rate': 22050}

>>> lj_speech[0]["text"]
'Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition'
```

462
Remember you should always [resample](preprocessing#audio) your audio dataset's sampling rate to match the sampling rate of the dataset used to pretrain a model!
Steven Liu's avatar
Steven Liu committed
463
464
465
466
467

```py
>>> lj_speech = lj_speech.cast_column("audio", Audio(sampling_rate=16_000))
```

468
Load a processor with [`AutoProcessor.from_pretrained`]:
Steven Liu's avatar
Steven Liu committed
469
470
471
472
473
474
475

```py
>>> from transformers import AutoProcessor

>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
```

476
1. Create a function to process the audio data contained in `array` to `input_values`, and tokenize `text` to `labels`. These are the inputs to the model:
Steven Liu's avatar
Steven Liu committed
477
478
479
480
481

```py
>>> def prepare_dataset(example):
...     audio = example["audio"]

482
...     example.update(processor(audio=audio["array"], text=example["text"], sampling_rate=16000))
Steven Liu's avatar
Steven Liu committed
483
484
485
486
487
488
489
490
491
492

...     return example
```

2. Apply the `prepare_dataset` function to a sample:

```py
>>> prepare_dataset(lj_speech[0])
```

amyeroberts's avatar
amyeroberts committed
493
The processor has now added `input_values` and `labels`, and the sampling rate has also been correctly downsampled to 16kHz. You can pass your processed dataset to the model now!