preprocessing.mdx 21 KB
Newer Older
Steven Liu's avatar
Steven Liu committed
1
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
Sylvain Gugger's avatar
Sylvain Gugger committed
2
3
4
5
6
7
8
9
10
11
12

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

Steven Liu's avatar
Steven Liu committed
13
# Preprocess
Sylvain Gugger's avatar
Sylvain Gugger committed
14

15
16
[[open-in-colab]]

17
Before you can train a model on a dataset, it needs to be preprocessed into the expected model input format. Whether your data is text, images, or audio, they need to be converted and assembled into batches of tensors. 馃 Transformers provides a set of preprocessing classes to help prepare your data for the model. In this tutorial, you'll learn that for:
Sylvain Gugger's avatar
Sylvain Gugger committed
18

19
* Text, use a [Tokenizer](./main_classes/tokenizer) to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors.
amyeroberts's avatar
amyeroberts committed
20
21
* Image inputs use a [ImageProcessor](./main_classes/image) to convert images into tensors.
* Speech and audio, use a [Feature extractor](./main_classes/feature_extractor) to extract sequential features from audio waveforms and convert them into tensors.
22
* Multimodal inputs, use a [Processor](./main_classes/processors) to combine a tokenizer and a feature extractor or image processor.
Steven Liu's avatar
Steven Liu committed
23

24
25
<Tip>

26
`AutoProcessor` **always** works and automatically chooses the correct class for the model you're using, whether you're using a tokenizer, image processor, feature extractor or processor.
27
28
29
30
31
32
33
34
35
36

</Tip>

Before you begin, install 馃 Datasets so you can load some datasets to experiment with:

```bash
pip install datasets
```

## Natural Language Processing
Steven Liu's avatar
Steven Liu committed
37
38
39

<Youtube id="Yffk5aydLzg"/>

40
The main tool for preprocessing textual data is a [tokenizer](main_classes/tokenizer). A tokenizer splits text into *tokens* according to a set of rules. The tokens are converted into numbers and then tensors, which become the model inputs. Any additional inputs required by the model are added by the tokenizer.
Sylvain Gugger's avatar
Sylvain Gugger committed
41
42
43

<Tip>

Steven Liu's avatar
Steven Liu committed
44
If you plan on using a pretrained model, it's important to use the associated pretrained tokenizer. This ensures the text is split the same way as the pretraining corpus, and uses the same corresponding tokens-to-index (usually referrred to as the *vocab*) during pretraining.
Sylvain Gugger's avatar
Sylvain Gugger committed
45
46
47

</Tip>

48
Get started by loading a pretrained tokenizer with the [`AutoTokenizer.from_pretrained`] method. This downloads the *vocab* a model was pretrained with:
Sylvain Gugger's avatar
Sylvain Gugger committed
49

Steven Liu's avatar
Steven Liu committed
50
51
```py
>>> from transformers import AutoTokenizer
Sylvain Gugger's avatar
Sylvain Gugger committed
52

Steven Liu's avatar
Steven Liu committed
53
54
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
```
Sylvain Gugger's avatar
Sylvain Gugger committed
55

56
Then pass your text to the tokenizer:
Sylvain Gugger's avatar
Sylvain Gugger committed
57
58

```py
Steven Liu's avatar
Steven Liu committed
59
>>> encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
Sylvain Gugger's avatar
Sylvain Gugger committed
60
>>> print(encoded_input)
Steven Liu's avatar
Steven Liu committed
61
62
63
{'input_ids': [101, 2079, 2025, 19960, 10362, 1999, 1996, 3821, 1997, 16657, 1010, 2005, 2027, 2024, 11259, 1998, 4248, 2000, 4963, 1012, 102], 
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Sylvain Gugger's avatar
Sylvain Gugger committed
64
65
```

Tobias Nusser's avatar
Tobias Nusser committed
66
The tokenizer returns a dictionary with three important items:
Steven Liu's avatar
Steven Liu committed
67
68
69
70

* [input_ids](glossary#input-ids) are the indices corresponding to each token in the sentence.
* [attention_mask](glossary#attention-mask) indicates whether a token should be attended to or not.
* [token_type_ids](glossary#token-type-ids) identifies which sequence a token belongs to when there is more than one sequence.
Sylvain Gugger's avatar
Sylvain Gugger committed
71

72
Return your input by decoding the `input_ids`:
Sylvain Gugger's avatar
Sylvain Gugger committed
73
74
75

```py
>>> tokenizer.decode(encoded_input["input_ids"])
Steven Liu's avatar
Steven Liu committed
76
'[CLS] Do not meddle in the affairs of wizards, for they are subtle and quick to anger. [SEP]'
Sylvain Gugger's avatar
Sylvain Gugger committed
77
78
```

Steven Liu's avatar
Steven Liu committed
79
As you can see, the tokenizer added two special tokens - `CLS` and `SEP` (classifier and separator) - to the sentence. Not all models need
80
special tokens, but if they do, the tokenizer automatically adds them for you.
Sylvain Gugger's avatar
Sylvain Gugger committed
81

82
If there are several sentences you want to preprocess, pass them as a list to the tokenizer:
Sylvain Gugger's avatar
Sylvain Gugger committed
83
84

```py
Steven Liu's avatar
Steven Liu committed
85
86
87
88
89
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
Sylvain Gugger's avatar
Sylvain Gugger committed
90
91
>>> encoded_inputs = tokenizer(batch_sentences)
>>> print(encoded_inputs)
Steven Liu's avatar
Steven Liu committed
92
93
94
95
96
97
98
99
100
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1]]}
Sylvain Gugger's avatar
Sylvain Gugger committed
101
102
```

Steven Liu's avatar
Steven Liu committed
103
### Pad
Sylvain Gugger's avatar
Sylvain Gugger committed
104

105
Sentences aren't always the same length which can be an issue because tensors, the model inputs, need to have a uniform shape. Padding is a strategy for ensuring tensors are rectangular by adding a special *padding token* to shorter sentences.
Sylvain Gugger's avatar
Sylvain Gugger committed
106

Steven Liu's avatar
Steven Liu committed
107
Set the `padding` parameter to `True` to pad the shorter sequences in the batch to match the longest sequence:
Sylvain Gugger's avatar
Sylvain Gugger committed
108

Steven Liu's avatar
Steven Liu committed
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
```py
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_input = tokenizer(batch_sentences, padding=True)
>>> print(encoded_input)
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
```

128
The first and third sentences are now padded with `0`'s because they are shorter.
Steven Liu's avatar
Steven Liu committed
129
130
131

### Truncation

132
On the other end of the spectrum, sometimes a sequence may be too long for a model to handle. In this case, you'll need to truncate the sequence to a shorter length.
Steven Liu's avatar
Steven Liu committed
133
134

Set the `truncation` parameter to `True` to truncate a sequence to the maximum length accepted by the model:
Sylvain Gugger's avatar
Sylvain Gugger committed
135
136

```py
Steven Liu's avatar
Steven Liu committed
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True)
>>> print(encoded_input)
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
```

155
156
157
158
159
160
<Tip>

Check out the [Padding and truncation](./pad_truncation) concept guide to learn more different padding and truncation arguments.

</Tip>

Steven Liu's avatar
Steven Liu committed
161
162
### Build tensors

163
Finally, you want the tokenizer to return the actual tensors that get fed to the model.
Steven Liu's avatar
Steven Liu committed
164
165
166

Set the `return_tensors` parameter to either `pt` for PyTorch, or `tf` for TensorFlow:

Sylvain Gugger's avatar
Sylvain Gugger committed
167
168
169
<frameworkcontent>
<pt>

Steven Liu's avatar
Steven Liu committed
170
171
172
173
174
175
```py
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
176
>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
Steven Liu's avatar
Steven Liu committed
177
>>> print(encoded_input)
178
179
180
181
182
183
184
185
186
{'input_ids': tensor([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
                      [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
                      [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]]), 
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
                           [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                           [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}
Sylvain Gugger's avatar
Sylvain Gugger committed
187
188
189
190
```
</pt>
<tf>
```py
Steven Liu's avatar
Steven Liu committed
191
192
193
194
195
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
196
>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf")
Steven Liu's avatar
Steven Liu committed
197
198
>>> print(encoded_input)
{'input_ids': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
199
200
201
array([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
       [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
       [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
Steven Liu's avatar
Steven Liu committed
202
203
      dtype=int32)>, 
 'token_type_ids': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
204
205
206
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>, 
Steven Liu's avatar
Steven Liu committed
207
 'attention_mask': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
208
209
210
array([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>}
Sylvain Gugger's avatar
Sylvain Gugger committed
211
```
Sylvain Gugger's avatar
Sylvain Gugger committed
212
213
</tf>
</frameworkcontent>
Sylvain Gugger's avatar
Sylvain Gugger committed
214

Steven Liu's avatar
Steven Liu committed
215
## Audio
Sylvain Gugger's avatar
Sylvain Gugger committed
216

217
For audio tasks, you'll need a [feature extractor](main_classes/feature_extractor) to prepare your dataset for the model. The feature extractor is designed to extract features from raw audio data, and convert them into tensors.
Sylvain Gugger's avatar
Sylvain Gugger committed
218

219
Load the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset (see the 馃 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset) to see how you can use a feature extractor with audio datasets:
Sylvain Gugger's avatar
Sylvain Gugger committed
220

Steven Liu's avatar
Steven Liu committed
221
222
```py
>>> from datasets import load_dataset, Audio
Sylvain Gugger's avatar
Sylvain Gugger committed
223

224
>>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")
Steven Liu's avatar
Steven Liu committed
225
```
Sylvain Gugger's avatar
Sylvain Gugger committed
226

227
Access the first element of the `audio` column to take a look at the input. Calling the `audio` column automatically loads and resamples the audio file:
Sylvain Gugger's avatar
Sylvain Gugger committed
228

Steven Liu's avatar
Steven Liu committed
229
```py
230
231
232
233
234
>>> dataset[0]["audio"]
{'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,
         0.        ,  0.        ], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 'sampling_rate': 8000}
Steven Liu's avatar
Steven Liu committed
235
```
Sylvain Gugger's avatar
Sylvain Gugger committed
236

Steven Liu's avatar
Steven Liu committed
237
238
239
240
241
242
This returns three items:

* `array` is the speech signal loaded - and potentially resampled - as a 1D array.
* `path` points to the location of the audio file.
* `sampling_rate` refers to how many data points in the speech signal are measured per second.

243
For this tutorial, you'll use the [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) model. Take a look at the model card, and you'll learn Wav2Vec2 is pretrained on 16kHz sampled speech audio. It is important your audio data's sampling rate matches the sampling rate of the dataset used to pretrain the model. If your data's sampling rate isn't the same, then you need to resample your data. 
Sylvain Gugger's avatar
Sylvain Gugger committed
244

Sylvain Gugger's avatar
Sylvain Gugger committed
245
1. Use 馃 Datasets' [`~datasets.Dataset.cast_column`] method to upsample the sampling rate to 16kHz:
Sylvain Gugger's avatar
Sylvain Gugger committed
246

Steven Liu's avatar
Steven Liu committed
247
```py
248
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
Steven Liu's avatar
Steven Liu committed
249
250
```

251
2. Call the `audio` column again to resample the audio file:
Sylvain Gugger's avatar
Sylvain Gugger committed
252
253

```py
254
255
256
257
>>> dataset[0]["audio"]
{'array': array([ 2.3443763e-05,  2.1729663e-04,  2.2145823e-04, ...,
         3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
Steven Liu's avatar
Steven Liu committed
258
 'sampling_rate': 16000}
Sylvain Gugger's avatar
Sylvain Gugger committed
259
260
```

261
Next, load a feature extractor to normalize and pad the input. When padding textual data, a `0` is added for shorter sequences. The same idea applies to audio data. The feature extractor adds a `0` - interpreted as silence - to `array`.
Steven Liu's avatar
Steven Liu committed
262
263

Load the feature extractor with [`AutoFeatureExtractor.from_pretrained`]:
Sylvain Gugger's avatar
Sylvain Gugger committed
264
265

```py
Steven Liu's avatar
Steven Liu committed
266
267
268
269
270
271
272
273
>>> from transformers import AutoFeatureExtractor

>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")
```

Pass the audio `array` to the feature extractor. We also recommend adding the `sampling_rate` argument in the feature extractor in order to better debug any silent errors that may occur.

```py
274
>>> audio_input = [dataset[0]["audio"]["array"]]
Steven Liu's avatar
Steven Liu committed
275
>>> feature_extractor(audio_input, sampling_rate=16000)
276
277
{'input_values': [array([ 3.8106556e-04,  2.7506407e-03,  2.8015103e-03, ...,
        5.6335266e-04,  4.6588284e-06, -1.7142107e-04], dtype=float32)]}
Sylvain Gugger's avatar
Sylvain Gugger committed
278
279
```

Steven Liu's avatar
Steven Liu committed
280
Just like the tokenizer, you can apply padding or truncation to handle variable sequences in a batch. Take a look at the sequence length of these two audio samples:
Sylvain Gugger's avatar
Sylvain Gugger committed
281
282

```py
283
284
>>> dataset[0]["audio"]["array"].shape
(173398,)
Steven Liu's avatar
Steven Liu committed
285

286
287
>>> dataset[1]["audio"]["array"].shape
(106496,)
Sylvain Gugger's avatar
Sylvain Gugger committed
288
289
```

290
Create a function to preprocess the dataset so the audio samples are the same lengths. Specify a maximum sample length, and the feature extractor will either pad or truncate the sequences to match it:
Sylvain Gugger's avatar
Sylvain Gugger committed
291
292

```py
Steven Liu's avatar
Steven Liu committed
293
294
295
296
297
298
>>> def preprocess_function(examples):
...     audio_arrays = [x["array"] for x in examples["audio"]]
...     inputs = feature_extractor(
...         audio_arrays,
...         sampling_rate=16000,
...         padding=True,
299
...         max_length=100000,
Steven Liu's avatar
Steven Liu committed
300
301
302
303
304
...         truncation=True,
...     )
...     return inputs
```

305
Apply the `preprocess_function` to the the first few examples in the dataset:
Steven Liu's avatar
Steven Liu committed
306
307

```py
308
>>> processed_dataset = preprocess_function(dataset[:5])
Steven Liu's avatar
Steven Liu committed
309
310
```

311
The sample lengths are now the same and match the specified maximum length. You can pass your processed dataset to the model now!
Steven Liu's avatar
Steven Liu committed
312
313
314

```py
>>> processed_dataset["input_values"][0].shape
315
(100000,)
Steven Liu's avatar
Steven Liu committed
316
317

>>> processed_dataset["input_values"][1].shape
318
(100000,)
Steven Liu's avatar
Steven Liu committed
319
320
```

321
322
## Computer vision

323
For computer vision tasks, you'll need an [image processor](main_classes/image_processor) to prepare your dataset for the model. The image processor is designed to preprocess images, and convert them into tensors.
Steven Liu's avatar
Steven Liu committed
324

325
Load the [food101](https://huggingface.co/datasets/food101) dataset (see the 馃 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset) to see how you can use an image processor with computer vision datasets:
Steven Liu's avatar
Steven Liu committed
326

327
328
329
<Tip>

Use 馃 Datasets `split` parameter to only load a small sample from the training split since the dataset is quite large!
Steven Liu's avatar
Steven Liu committed
330

331
</Tip>
Steven Liu's avatar
Steven Liu committed
332
333
334
335
336
337
338
339
340
341
342
343
344

```py
>>> from datasets import load_dataset

>>> dataset = load_dataset("food101", split="train[:100]")
```

Next, take a look at the image with 馃 Datasets [`Image`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=image#datasets.Image) feature:

```py
>>> dataset[0]["image"]
```

345
346
347
<div class="flex justify-center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vision-preprocess-tutorial.png"/>
</div>
Steven Liu's avatar
Steven Liu committed
348

349
Load the image processor with [`AutoImageProcessor.from_pretrained`]:
Steven Liu's avatar
Steven Liu committed
350
351

```py
352
>>> from transformers import AutoImageProcessor
Steven Liu's avatar
Steven Liu committed
353

354
>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
Sylvain Gugger's avatar
Sylvain Gugger committed
355
356
```

357
For computer vision tasks, it is common to add some type of data augmentation to the images as a part of preprocessing. You can add augmentations with any library you'd like, but in this tutorial, you'll use torchvision's [`transforms`](https://pytorch.org/vision/stable/transforms.html) module. If you're interested in using another data augmentation library, learn how in the [Albumentations](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_albumentations.ipynb) or [Kornia notebooks](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_kornia.ipynb).
Steven Liu's avatar
Steven Liu committed
358

359
1. Normalize the image with the image processor and use [`Compose`](https://pytorch.org/vision/master/generated/torchvision.transforms.Compose.html) to chain some transforms - [`RandomResizedCrop`](https://pytorch.org/vision/main/generated/torchvision.transforms.RandomResizedCrop.html) and [`ColorJitter`](https://pytorch.org/vision/main/generated/torchvision.transforms.ColorJitter.html) - together:
Steven Liu's avatar
Steven Liu committed
360
361
362
363

```py
>>> from torchvision.transforms import Compose, Normalize, RandomResizedCrop, ColorJitter, ToTensor

364
>>> normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std)
amyeroberts's avatar
amyeroberts committed
365
>>> size = (
366
367
368
...     image_processor.size["shortest_edge"]
...     if "shortest_edge" in image_processor.size
...     else (image_processor.size["height"], image_processor.size["width"])
Steven Liu's avatar
Steven Liu committed
369
... )
amyeroberts's avatar
amyeroberts committed
370
>>> _transforms = Compose([RandomResizedCrop(size), ColorJitter(brightness=0.5, hue=0.5), ToTensor(), normalize])
Steven Liu's avatar
Steven Liu committed
371
372
```

373
2. The model accepts [`pixel_values`](model_doc/visionencoderdecoder#transformers.VisionEncoderDecoderModel.forward.pixel_values) as its input, which is generated by the image processor. Create a function that generates `pixel_values` from the transforms:
Steven Liu's avatar
Steven Liu committed
374
375
376
377
378
379
380

```py
>>> def transforms(examples):
...     examples["pixel_values"] = [_transforms(image.convert("RGB")) for image in examples["image"]]
...     return examples
```

381
3. Then use 馃 Datasets [`set_transform`](https://huggingface.co/docs/datasets/process.html#format-transform) to apply the transforms on the fly:
Steven Liu's avatar
Steven Liu committed
382
383
384
385
386

```py
>>> dataset.set_transform(transforms)
```

387
4. Now when you access the image, you'll notice the image processor has added `pixel_values`. You can pass your processed dataset to the model now!
Steven Liu's avatar
Steven Liu committed
388
389

```py
390
>>> dataset[0].keys()
Steven Liu's avatar
Steven Liu committed
391
392
```

393
Here is what the image looks like after the transforms are applied. The image has been randomly cropped and it's color properties are different.
Steven Liu's avatar
Steven Liu committed
394
395
396
397
398
399
400
401
402

```py
>>> import numpy as np
>>> import matplotlib.pyplot as plt

>>> img = dataset[0]["pixel_values"]
>>> plt.imshow(img.permute(1, 2, 0))
```

403
404
405
<div class="flex justify-center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/preprocessed_image.png"/>
</div>
Steven Liu's avatar
Steven Liu committed
406
407
408

## Multimodal

409
For tasks involving multimodal inputs, you'll need a [processor](main_classes/processors) to prepare your dataset for the model. A processor couples together two processing objects such as as tokenizer and feature extractor.
Steven Liu's avatar
Steven Liu committed
410

411
Load the [LJ Speech](https://huggingface.co/datasets/lj_speech) dataset (see the 馃 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset) to see how you can use a processor for automatic speech recognition (ASR):
Steven Liu's avatar
Steven Liu committed
412
413
414
415
416
417
418

```py
>>> from datasets import load_dataset

>>> lj_speech = load_dataset("lj_speech", split="train")
```

419
For ASR, you're mainly focused on `audio` and `text` so you can remove the other columns:
Steven Liu's avatar
Steven Liu committed
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437

```py
>>> lj_speech = lj_speech.map(remove_columns=["file", "id", "normalized_text"])
```

Now take a look at the `audio` and `text` columns:

```py
>>> lj_speech[0]["audio"]
{'array': array([-7.3242188e-04, -7.6293945e-04, -6.4086914e-04, ...,
         7.3242188e-04,  2.1362305e-04,  6.1035156e-05], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/917ece08c95cf0c4115e45294e3cd0dee724a1165b7fc11798369308a465bd26/LJSpeech-1.1/wavs/LJ001-0001.wav',
 'sampling_rate': 22050}

>>> lj_speech[0]["text"]
'Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition'
```

438
Remember you should always [resample](preprocessing#audio) your audio dataset's sampling rate to match the sampling rate of the dataset used to pretrain a model!
Steven Liu's avatar
Steven Liu committed
439
440
441
442
443

```py
>>> lj_speech = lj_speech.cast_column("audio", Audio(sampling_rate=16_000))
```

444
Load a processor with [`AutoProcessor.from_pretrained`]:
Steven Liu's avatar
Steven Liu committed
445
446
447
448
449
450
451

```py
>>> from transformers import AutoProcessor

>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
```

452
1. Create a function to process the audio data contained in `array` to `input_values`, and tokenize `text` to `labels`. These are the inputs to the model:
Steven Liu's avatar
Steven Liu committed
453
454
455
456
457

```py
>>> def prepare_dataset(example):
...     audio = example["audio"]

458
...     example.update(processor(audio=audio["array"], text=example["text"], sampling_rate=16000))
Steven Liu's avatar
Steven Liu committed
459
460
461
462
463
464
465
466
467
468

...     return example
```

2. Apply the `prepare_dataset` function to a sample:

```py
>>> prepare_dataset(lj_speech[0])
```

amyeroberts's avatar
amyeroberts committed
469
The processor has now added `input_values` and `labels`, and the sampling rate has also been correctly downsampled to 16kHz. You can pass your processed dataset to the model now!