preprocessing.md 24 KB
Newer Older
1
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Sylvain Gugger's avatar
Sylvain Gugger committed
2
3
4
5
6
7
8
9
10

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
11
12
13
14

鈿狅笍 Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

Sylvain Gugger's avatar
Sylvain Gugger committed
15
16
-->

Steven Liu's avatar
Steven Liu committed
17
# Preprocess
Sylvain Gugger's avatar
Sylvain Gugger committed
18

19
20
[[open-in-colab]]

21
Before you can train a model on a dataset, it needs to be preprocessed into the expected model input format. Whether your data is text, images, or audio, they need to be converted and assembled into batches of tensors. 馃 Transformers provides a set of preprocessing classes to help prepare your data for the model. In this tutorial, you'll learn that for:
Sylvain Gugger's avatar
Sylvain Gugger committed
22

23
* Text, use a [Tokenizer](./main_classes/tokenizer) to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors.
amyeroberts's avatar
amyeroberts committed
24
* Speech and audio, use a [Feature extractor](./main_classes/feature_extractor) to extract sequential features from audio waveforms and convert them into tensors.
D's avatar
D committed
25
* Image inputs use a [ImageProcessor](./main_classes/image_processor) to convert images into tensors.
26
* Multimodal inputs, use a [Processor](./main_classes/processors) to combine a tokenizer and a feature extractor or image processor.
Steven Liu's avatar
Steven Liu committed
27

28
29
<Tip>

30
`AutoProcessor` **always** works and automatically chooses the correct class for the model you're using, whether you're using a tokenizer, image processor, feature extractor or processor.
31
32
33
34
35
36
37
38
39
40

</Tip>

Before you begin, install 馃 Datasets so you can load some datasets to experiment with:

```bash
pip install datasets
```

## Natural Language Processing
Steven Liu's avatar
Steven Liu committed
41
42
43

<Youtube id="Yffk5aydLzg"/>

44
The main tool for preprocessing textual data is a [tokenizer](main_classes/tokenizer). A tokenizer splits text into *tokens* according to a set of rules. The tokens are converted into numbers and then tensors, which become the model inputs. Any additional inputs required by the model are added by the tokenizer.
Sylvain Gugger's avatar
Sylvain Gugger committed
45
46
47

<Tip>

digger-yu's avatar
digger-yu committed
48
If you plan on using a pretrained model, it's important to use the associated pretrained tokenizer. This ensures the text is split the same way as the pretraining corpus, and uses the same corresponding tokens-to-index (usually referred to as the *vocab*) during pretraining.
Sylvain Gugger's avatar
Sylvain Gugger committed
49
50
51

</Tip>

52
Get started by loading a pretrained tokenizer with the [`AutoTokenizer.from_pretrained`] method. This downloads the *vocab* a model was pretrained with:
Sylvain Gugger's avatar
Sylvain Gugger committed
53

Steven Liu's avatar
Steven Liu committed
54
55
```py
>>> from transformers import AutoTokenizer
Sylvain Gugger's avatar
Sylvain Gugger committed
56

Steven Liu's avatar
Steven Liu committed
57
58
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
```
Sylvain Gugger's avatar
Sylvain Gugger committed
59

60
Then pass your text to the tokenizer:
Sylvain Gugger's avatar
Sylvain Gugger committed
61
62

```py
Steven Liu's avatar
Steven Liu committed
63
>>> encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
Sylvain Gugger's avatar
Sylvain Gugger committed
64
>>> print(encoded_input)
65
66
{'input_ids': [101, 2079, 2025, 19960, 10362, 1999, 1996, 3821, 1997, 16657, 1010, 2005, 2027, 2024, 11259, 1998, 4248, 2000, 4963, 1012, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
Steven Liu's avatar
Steven Liu committed
67
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Sylvain Gugger's avatar
Sylvain Gugger committed
68
69
```

Tobias Nusser's avatar
Tobias Nusser committed
70
The tokenizer returns a dictionary with three important items:
Steven Liu's avatar
Steven Liu committed
71
72
73
74

* [input_ids](glossary#input-ids) are the indices corresponding to each token in the sentence.
* [attention_mask](glossary#attention-mask) indicates whether a token should be attended to or not.
* [token_type_ids](glossary#token-type-ids) identifies which sequence a token belongs to when there is more than one sequence.
Sylvain Gugger's avatar
Sylvain Gugger committed
75

76
Return your input by decoding the `input_ids`:
Sylvain Gugger's avatar
Sylvain Gugger committed
77
78
79

```py
>>> tokenizer.decode(encoded_input["input_ids"])
Steven Liu's avatar
Steven Liu committed
80
'[CLS] Do not meddle in the affairs of wizards, for they are subtle and quick to anger. [SEP]'
Sylvain Gugger's avatar
Sylvain Gugger committed
81
82
```

Steven Liu's avatar
Steven Liu committed
83
As you can see, the tokenizer added two special tokens - `CLS` and `SEP` (classifier and separator) - to the sentence. Not all models need
84
special tokens, but if they do, the tokenizer automatically adds them for you.
Sylvain Gugger's avatar
Sylvain Gugger committed
85

86
If there are several sentences you want to preprocess, pass them as a list to the tokenizer:
Sylvain Gugger's avatar
Sylvain Gugger committed
87
88

```py
Steven Liu's avatar
Steven Liu committed
89
90
91
92
93
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
Sylvain Gugger's avatar
Sylvain Gugger committed
94
95
>>> encoded_inputs = tokenizer(batch_sentences)
>>> print(encoded_inputs)
96
97
98
99
100
101
102
103
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102],
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
               [101, 1327, 1164, 5450, 23434, 136, 102]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
Steven Liu's avatar
Steven Liu committed
104
                    [1, 1, 1, 1, 1, 1, 1]]}
Sylvain Gugger's avatar
Sylvain Gugger committed
105
106
```

Steven Liu's avatar
Steven Liu committed
107
### Pad
Sylvain Gugger's avatar
Sylvain Gugger committed
108

109
Sentences aren't always the same length which can be an issue because tensors, the model inputs, need to have a uniform shape. Padding is a strategy for ensuring tensors are rectangular by adding a special *padding token* to shorter sentences.
Sylvain Gugger's avatar
Sylvain Gugger committed
110

Steven Liu's avatar
Steven Liu committed
111
Set the `padding` parameter to `True` to pad the shorter sequences in the batch to match the longest sequence:
Sylvain Gugger's avatar
Sylvain Gugger committed
112

Steven Liu's avatar
Steven Liu committed
113
114
115
116
117
118
119
120
```py
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_input = tokenizer(batch_sentences, padding=True)
>>> print(encoded_input)
121
122
123
124
125
126
127
128
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
Steven Liu's avatar
Steven Liu committed
129
130
131
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
```

132
The first and third sentences are now padded with `0`'s because they are shorter.
Steven Liu's avatar
Steven Liu committed
133
134
135

### Truncation

136
On the other end of the spectrum, sometimes a sequence may be too long for a model to handle. In this case, you'll need to truncate the sequence to a shorter length.
Steven Liu's avatar
Steven Liu committed
137
138

Set the `truncation` parameter to `True` to truncate a sequence to the maximum length accepted by the model:
Sylvain Gugger's avatar
Sylvain Gugger committed
139
140

```py
Steven Liu's avatar
Steven Liu committed
141
142
143
144
145
146
147
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True)
>>> print(encoded_input)
148
149
150
151
152
153
154
155
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
Steven Liu's avatar
Steven Liu committed
156
157
158
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
```

159
160
161
162
163
164
<Tip>

Check out the [Padding and truncation](./pad_truncation) concept guide to learn more different padding and truncation arguments.

</Tip>

Steven Liu's avatar
Steven Liu committed
165
166
### Build tensors

167
Finally, you want the tokenizer to return the actual tensors that get fed to the model.
Steven Liu's avatar
Steven Liu committed
168
169
170

Set the `return_tensors` parameter to either `pt` for PyTorch, or `tf` for TensorFlow:

Sylvain Gugger's avatar
Sylvain Gugger committed
171
172
173
<frameworkcontent>
<pt>

Steven Liu's avatar
Steven Liu committed
174
175
176
177
178
179
```py
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
180
>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
Steven Liu's avatar
Steven Liu committed
181
>>> print(encoded_input)
182
183
{'input_ids': tensor([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
                      [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
184
                      [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]]),
185
186
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
187
                           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
188
189
190
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
                           [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                           [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}
Sylvain Gugger's avatar
Sylvain Gugger committed
191
192
193
194
```
</pt>
<tf>
```py
Steven Liu's avatar
Steven Liu committed
195
196
197
198
199
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
200
>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf")
Steven Liu's avatar
Steven Liu committed
201
202
>>> print(encoded_input)
{'input_ids': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
203
204
205
array([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
       [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
       [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
206
      dtype=int32)>,
Steven Liu's avatar
Steven Liu committed
207
 'token_type_ids': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
208
209
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
210
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>,
Steven Liu's avatar
Steven Liu committed
211
 'attention_mask': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
212
213
214
array([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>}
Sylvain Gugger's avatar
Sylvain Gugger committed
215
```
Sylvain Gugger's avatar
Sylvain Gugger committed
216
217
</tf>
</frameworkcontent>
Sylvain Gugger's avatar
Sylvain Gugger committed
218

219
220
221
222
223
224
<Tip>
Different pipelines support tokenizer arguments in their `__call__()` differently. `text-2-text-generation` pipelines support (i.e. pass on)
only `truncation`. `text-generation` pipelines support `max_length`, `truncation`, `padding` and `add_special_tokens`. 
In `fill-mask` pipelines, tokenizer arguments can be passed in the `tokenizer_kwargs` argument (dictionary).
</Tip>

Steven Liu's avatar
Steven Liu committed
225
## Audio
Sylvain Gugger's avatar
Sylvain Gugger committed
226

227
For audio tasks, you'll need a [feature extractor](main_classes/feature_extractor) to prepare your dataset for the model. The feature extractor is designed to extract features from raw audio data, and convert them into tensors.
Sylvain Gugger's avatar
Sylvain Gugger committed
228

229
Load the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset (see the 馃 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub) for more details on how to load a dataset) to see how you can use a feature extractor with audio datasets:
Sylvain Gugger's avatar
Sylvain Gugger committed
230

Steven Liu's avatar
Steven Liu committed
231
232
```py
>>> from datasets import load_dataset, Audio
Sylvain Gugger's avatar
Sylvain Gugger committed
233

234
>>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")
Steven Liu's avatar
Steven Liu committed
235
```
Sylvain Gugger's avatar
Sylvain Gugger committed
236

237
Access the first element of the `audio` column to take a look at the input. Calling the `audio` column automatically loads and resamples the audio file:
Sylvain Gugger's avatar
Sylvain Gugger committed
238

Steven Liu's avatar
Steven Liu committed
239
```py
240
241
242
243
244
>>> dataset[0]["audio"]
{'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,
         0.        ,  0.        ], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 'sampling_rate': 8000}
Steven Liu's avatar
Steven Liu committed
245
```
Sylvain Gugger's avatar
Sylvain Gugger committed
246

Steven Liu's avatar
Steven Liu committed
247
248
249
250
251
252
This returns three items:

* `array` is the speech signal loaded - and potentially resampled - as a 1D array.
* `path` points to the location of the audio file.
* `sampling_rate` refers to how many data points in the speech signal are measured per second.

253
For this tutorial, you'll use the [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) model. Take a look at the model card, and you'll learn Wav2Vec2 is pretrained on 16kHz sampled speech audio. It is important your audio data's sampling rate matches the sampling rate of the dataset used to pretrain the model. If your data's sampling rate isn't the same, then you need to resample your data.
Sylvain Gugger's avatar
Sylvain Gugger committed
254

Sylvain Gugger's avatar
Sylvain Gugger committed
255
1. Use 馃 Datasets' [`~datasets.Dataset.cast_column`] method to upsample the sampling rate to 16kHz:
Sylvain Gugger's avatar
Sylvain Gugger committed
256

Steven Liu's avatar
Steven Liu committed
257
```py
258
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
Steven Liu's avatar
Steven Liu committed
259
260
```

261
2. Call the `audio` column again to resample the audio file:
Sylvain Gugger's avatar
Sylvain Gugger committed
262
263

```py
264
265
266
267
>>> dataset[0]["audio"]
{'array': array([ 2.3443763e-05,  2.1729663e-04,  2.2145823e-04, ...,
         3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
Steven Liu's avatar
Steven Liu committed
268
 'sampling_rate': 16000}
Sylvain Gugger's avatar
Sylvain Gugger committed
269
270
```

271
Next, load a feature extractor to normalize and pad the input. When padding textual data, a `0` is added for shorter sequences. The same idea applies to audio data. The feature extractor adds a `0` - interpreted as silence - to `array`.
Steven Liu's avatar
Steven Liu committed
272
273

Load the feature extractor with [`AutoFeatureExtractor.from_pretrained`]:
Sylvain Gugger's avatar
Sylvain Gugger committed
274
275

```py
Steven Liu's avatar
Steven Liu committed
276
277
278
279
280
281
282
283
>>> from transformers import AutoFeatureExtractor

>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")
```

Pass the audio `array` to the feature extractor. We also recommend adding the `sampling_rate` argument in the feature extractor in order to better debug any silent errors that may occur.

```py
284
>>> audio_input = [dataset[0]["audio"]["array"]]
Steven Liu's avatar
Steven Liu committed
285
>>> feature_extractor(audio_input, sampling_rate=16000)
286
287
{'input_values': [array([ 3.8106556e-04,  2.7506407e-03,  2.8015103e-03, ...,
        5.6335266e-04,  4.6588284e-06, -1.7142107e-04], dtype=float32)]}
Sylvain Gugger's avatar
Sylvain Gugger committed
288
289
```

Steven Liu's avatar
Steven Liu committed
290
Just like the tokenizer, you can apply padding or truncation to handle variable sequences in a batch. Take a look at the sequence length of these two audio samples:
Sylvain Gugger's avatar
Sylvain Gugger committed
291
292

```py
293
294
>>> dataset[0]["audio"]["array"].shape
(173398,)
Steven Liu's avatar
Steven Liu committed
295

296
297
>>> dataset[1]["audio"]["array"].shape
(106496,)
Sylvain Gugger's avatar
Sylvain Gugger committed
298
299
```

300
Create a function to preprocess the dataset so the audio samples are the same lengths. Specify a maximum sample length, and the feature extractor will either pad or truncate the sequences to match it:
Sylvain Gugger's avatar
Sylvain Gugger committed
301
302

```py
Steven Liu's avatar
Steven Liu committed
303
304
305
306
307
308
>>> def preprocess_function(examples):
...     audio_arrays = [x["array"] for x in examples["audio"]]
...     inputs = feature_extractor(
...         audio_arrays,
...         sampling_rate=16000,
...         padding=True,
309
...         max_length=100000,
Steven Liu's avatar
Steven Liu committed
310
311
312
313
314
...         truncation=True,
...     )
...     return inputs
```

315
Apply the `preprocess_function` to the first few examples in the dataset:
Steven Liu's avatar
Steven Liu committed
316
317

```py
318
>>> processed_dataset = preprocess_function(dataset[:5])
Steven Liu's avatar
Steven Liu committed
319
320
```

321
The sample lengths are now the same and match the specified maximum length. You can pass your processed dataset to the model now!
Steven Liu's avatar
Steven Liu committed
322
323
324

```py
>>> processed_dataset["input_values"][0].shape
325
(100000,)
Steven Liu's avatar
Steven Liu committed
326
327

>>> processed_dataset["input_values"][1].shape
328
(100000,)
Steven Liu's avatar
Steven Liu committed
329
330
```

331
332
## Computer vision

333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
For computer vision tasks, you'll need an [image processor](main_classes/image_processor) to prepare your dataset for the model.
Image preprocessing consists of several steps that convert images into the input expected by the model. These steps
include but are not limited to resizing, normalizing, color channel correction, and converting images to tensors.

<Tip>

Image preprocessing often follows some form of image augmentation. Both image preprocessing and image augmentation
transform image data, but they serve different purposes:

* Image augmentation alters images in a way that can help prevent overfitting and increase the robustness of the model. You can get creative in how you augment your data - adjust brightness and colors, crop, rotate, resize, zoom, etc. However, be mindful not to change the meaning of the images with your augmentations.
* Image preprocessing guarantees that the images match the model鈥檚 expected input format. When fine-tuning a computer vision model, images must be preprocessed exactly as when the model was initially trained.

You can use any library you like for image augmentation. For image preprocessing, use the `ImageProcessor` associated with the model.

</Tip>
Steven Liu's avatar
Steven Liu committed
348

349
Load the [food101](https://huggingface.co/datasets/food101) dataset (see the 馃 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub) for more details on how to load a dataset) to see how you can use an image processor with computer vision datasets:
Steven Liu's avatar
Steven Liu committed
350

351
352
353
<Tip>

Use 馃 Datasets `split` parameter to only load a small sample from the training split since the dataset is quite large!
Steven Liu's avatar
Steven Liu committed
354

355
</Tip>
Steven Liu's avatar
Steven Liu committed
356
357
358
359
360
361
362

```py
>>> from datasets import load_dataset

>>> dataset = load_dataset("food101", split="train[:100]")
```

363
Next, take a look at the image with 馃 Datasets [`Image`](https://huggingface.co/docs/datasets/package_reference/main_classes?highlight=image#datasets.Image) feature:
Steven Liu's avatar
Steven Liu committed
364
365
366
367
368

```py
>>> dataset[0]["image"]
```

369
370
371
<div class="flex justify-center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vision-preprocess-tutorial.png"/>
</div>
Steven Liu's avatar
Steven Liu committed
372

373
Load the image processor with [`AutoImageProcessor.from_pretrained`]:
Steven Liu's avatar
Steven Liu committed
374
375

```py
376
>>> from transformers import AutoImageProcessor
Steven Liu's avatar
Steven Liu committed
377

378
>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
Sylvain Gugger's avatar
Sylvain Gugger committed
379
380
```

381
First, let's add some image augmentation. You can use any library you prefer, but in this tutorial, we'll use torchvision's [`transforms`](https://pytorch.org/vision/stable/transforms.html) module. If you're interested in using another data augmentation library, learn how in the [Albumentations](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_albumentations.ipynb) or [Kornia notebooks](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_kornia.ipynb).
Steven Liu's avatar
Steven Liu committed
382

383
384
385
386
1. Here we use [`Compose`](https://pytorch.org/vision/master/generated/torchvision.transforms.Compose.html) to chain together a couple of
transforms - [`RandomResizedCrop`](https://pytorch.org/vision/main/generated/torchvision.transforms.RandomResizedCrop.html) and [`ColorJitter`](https://pytorch.org/vision/main/generated/torchvision.transforms.ColorJitter.html).
Note that for resizing, we can get the image size requirements from the `image_processor`. For some models, an exact height and
width are expected, for others only the `shortest_edge` is defined.
Steven Liu's avatar
Steven Liu committed
387
388

```py
389
>>> from torchvision.transforms import RandomResizedCrop, ColorJitter, Compose
Steven Liu's avatar
Steven Liu committed
390

amyeroberts's avatar
amyeroberts committed
391
>>> size = (
392
393
394
...     image_processor.size["shortest_edge"]
...     if "shortest_edge" in image_processor.size
...     else (image_processor.size["height"], image_processor.size["width"])
Steven Liu's avatar
Steven Liu committed
395
... )
396
397

>>> _transforms = Compose([RandomResizedCrop(size), ColorJitter(brightness=0.5, hue=0.5)])
Steven Liu's avatar
Steven Liu committed
398
399
```

D's avatar
D committed
400
2. The model accepts [`pixel_values`](model_doc/vision-encoder-decoder#transformers.VisionEncoderDecoderModel.forward.pixel_values)
401
402
as its input. `ImageProcessor` can take care of normalizing the images, and generating appropriate tensors.
Create a function that combines image augmentation and image preprocessing for a batch of images and generates `pixel_values`:
Steven Liu's avatar
Steven Liu committed
403
404
405

```py
>>> def transforms(examples):
406
407
...     images = [_transforms(img.convert("RGB")) for img in examples["image"]]
...     examples["pixel_values"] = image_processor(images, do_resize=False, return_tensors="pt")["pixel_values"]
Steven Liu's avatar
Steven Liu committed
408
409
410
...     return examples
```

411
412
413
414
415
416
417
418
419
420
<Tip>

In the example above we set `do_resize=False` because we have already resized the images in the image augmentation transformation,
and leveraged the `size` attribute from the appropriate `image_processor`. If you do not resize images during image augmentation,
leave this parameter out. By default, `ImageProcessor` will handle the resizing.

If you wish to normalize images as a part of the augmentation transformation, use the `image_processor.image_mean`,
and `image_processor.image_std` values.
</Tip>

421
3. Then use 馃 Datasets[`~datasets.Dataset.set_transform`] to apply the transforms on the fly:
Steven Liu's avatar
Steven Liu committed
422
423
424
425
```py
>>> dataset.set_transform(transforms)
```

426
4. Now when you access the image, you'll notice the image processor has added `pixel_values`. You can pass your processed dataset to the model now!
Steven Liu's avatar
Steven Liu committed
427
428

```py
429
>>> dataset[0].keys()
Steven Liu's avatar
Steven Liu committed
430
431
```

432
Here is what the image looks like after the transforms are applied. The image has been randomly cropped and it's color properties are different.
Steven Liu's avatar
Steven Liu committed
433
434
435
436
437
438
439
440
441

```py
>>> import numpy as np
>>> import matplotlib.pyplot as plt

>>> img = dataset[0]["pixel_values"]
>>> plt.imshow(img.permute(1, 2, 0))
```

442
443
444
<div class="flex justify-center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/preprocessed_image.png"/>
</div>
Steven Liu's avatar
Steven Liu committed
445

446
447
448
449
450
451
452
453
454
455
456
<Tip>

For tasks like object detection, semantic segmentation, instance segmentation, and panoptic segmentation, `ImageProcessor`
offers post processing methods. These methods convert model's raw outputs into meaningful predictions such as bounding boxes,
or segmentation maps.

</Tip>

### Pad

In some cases, for instance, when fine-tuning [DETR](./model_doc/detr), the model applies scale augmentation at training
457
time. This may cause images to be different sizes in a batch. You can use [`DetrImageProcessor.pad`]
458
459
460
461
462
from [`DetrImageProcessor`] and define a custom `collate_fn` to batch images together.

```py
>>> def collate_fn(batch):
...     pixel_values = [item["pixel_values"] for item in batch]
463
...     encoding = image_processor.pad(pixel_values, return_tensors="pt")
464
465
466
467
468
469
470
471
...     labels = [item["labels"] for item in batch]
...     batch = {}
...     batch["pixel_values"] = encoding["pixel_values"]
...     batch["pixel_mask"] = encoding["pixel_mask"]
...     batch["labels"] = labels
...     return batch
```

Steven Liu's avatar
Steven Liu committed
472
473
## Multimodal

474
For tasks involving multimodal inputs, you'll need a [processor](main_classes/processors) to prepare your dataset for the model. A processor couples together two processing objects such as as tokenizer and feature extractor.
Steven Liu's avatar
Steven Liu committed
475

476
Load the [LJ Speech](https://huggingface.co/datasets/lj_speech) dataset (see the 馃 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub) for more details on how to load a dataset) to see how you can use a processor for automatic speech recognition (ASR):
Steven Liu's avatar
Steven Liu committed
477
478
479
480
481
482
483

```py
>>> from datasets import load_dataset

>>> lj_speech = load_dataset("lj_speech", split="train")
```

484
For ASR, you're mainly focused on `audio` and `text` so you can remove the other columns:
Steven Liu's avatar
Steven Liu committed
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502

```py
>>> lj_speech = lj_speech.map(remove_columns=["file", "id", "normalized_text"])
```

Now take a look at the `audio` and `text` columns:

```py
>>> lj_speech[0]["audio"]
{'array': array([-7.3242188e-04, -7.6293945e-04, -6.4086914e-04, ...,
         7.3242188e-04,  2.1362305e-04,  6.1035156e-05], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/917ece08c95cf0c4115e45294e3cd0dee724a1165b7fc11798369308a465bd26/LJSpeech-1.1/wavs/LJ001-0001.wav',
 'sampling_rate': 22050}

>>> lj_speech[0]["text"]
'Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition'
```

503
Remember you should always [resample](preprocessing#audio) your audio dataset's sampling rate to match the sampling rate of the dataset used to pretrain a model!
Steven Liu's avatar
Steven Liu committed
504
505
506
507
508

```py
>>> lj_speech = lj_speech.cast_column("audio", Audio(sampling_rate=16_000))
```

509
Load a processor with [`AutoProcessor.from_pretrained`]:
Steven Liu's avatar
Steven Liu committed
510
511
512
513
514
515
516

```py
>>> from transformers import AutoProcessor

>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
```

517
1. Create a function to process the audio data contained in `array` to `input_values`, and tokenize `text` to `labels`. These are the inputs to the model:
Steven Liu's avatar
Steven Liu committed
518
519
520
521
522

```py
>>> def prepare_dataset(example):
...     audio = example["audio"]

523
...     example.update(processor(audio=audio["array"], text=example["text"], sampling_rate=16000))
Steven Liu's avatar
Steven Liu committed
524
525
526
527
528
529
530
531
532
533

...     return example
```

2. Apply the `prepare_dataset` function to a sample:

```py
>>> prepare_dataset(lj_speech[0])
```

amyeroberts's avatar
amyeroberts committed
534
The processor has now added `input_values` and `labels`, and the sampling rate has also been correctly downsampled to 16kHz. You can pass your processed dataset to the model now!