preprocessing.md 23.6 KB
Newer Older
1
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Sylvain Gugger's avatar
Sylvain Gugger committed
2
3
4
5
6
7
8
9
10

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
11
12
13
14

鈿狅笍 Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

Sylvain Gugger's avatar
Sylvain Gugger committed
15
16
-->

Steven Liu's avatar
Steven Liu committed
17
# Preprocess
Sylvain Gugger's avatar
Sylvain Gugger committed
18

19
20
[[open-in-colab]]

21
Before you can train a model on a dataset, it needs to be preprocessed into the expected model input format. Whether your data is text, images, or audio, they need to be converted and assembled into batches of tensors. 馃 Transformers provides a set of preprocessing classes to help prepare your data for the model. In this tutorial, you'll learn that for:
Sylvain Gugger's avatar
Sylvain Gugger committed
22

23
* Text, use a [Tokenizer](./main_classes/tokenizer) to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors.
amyeroberts's avatar
amyeroberts committed
24
* Speech and audio, use a [Feature extractor](./main_classes/feature_extractor) to extract sequential features from audio waveforms and convert them into tensors.
25
* Image inputs use a [ImageProcessor](./main_classes/image) to convert images into tensors.
26
* Multimodal inputs, use a [Processor](./main_classes/processors) to combine a tokenizer and a feature extractor or image processor.
Steven Liu's avatar
Steven Liu committed
27

28
29
<Tip>

30
`AutoProcessor` **always** works and automatically chooses the correct class for the model you're using, whether you're using a tokenizer, image processor, feature extractor or processor.
31
32
33
34
35
36
37
38
39
40

</Tip>

Before you begin, install 馃 Datasets so you can load some datasets to experiment with:

```bash
pip install datasets
```

## Natural Language Processing
Steven Liu's avatar
Steven Liu committed
41
42
43

<Youtube id="Yffk5aydLzg"/>

44
The main tool for preprocessing textual data is a [tokenizer](main_classes/tokenizer). A tokenizer splits text into *tokens* according to a set of rules. The tokens are converted into numbers and then tensors, which become the model inputs. Any additional inputs required by the model are added by the tokenizer.
Sylvain Gugger's avatar
Sylvain Gugger committed
45
46
47

<Tip>

digger-yu's avatar
digger-yu committed
48
If you plan on using a pretrained model, it's important to use the associated pretrained tokenizer. This ensures the text is split the same way as the pretraining corpus, and uses the same corresponding tokens-to-index (usually referred to as the *vocab*) during pretraining.
Sylvain Gugger's avatar
Sylvain Gugger committed
49
50
51

</Tip>

52
Get started by loading a pretrained tokenizer with the [`AutoTokenizer.from_pretrained`] method. This downloads the *vocab* a model was pretrained with:
Sylvain Gugger's avatar
Sylvain Gugger committed
53

Steven Liu's avatar
Steven Liu committed
54
55
```py
>>> from transformers import AutoTokenizer
Sylvain Gugger's avatar
Sylvain Gugger committed
56

Steven Liu's avatar
Steven Liu committed
57
58
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
```
Sylvain Gugger's avatar
Sylvain Gugger committed
59

60
Then pass your text to the tokenizer:
Sylvain Gugger's avatar
Sylvain Gugger committed
61
62

```py
Steven Liu's avatar
Steven Liu committed
63
>>> encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
Sylvain Gugger's avatar
Sylvain Gugger committed
64
>>> print(encoded_input)
65
66
{'input_ids': [101, 2079, 2025, 19960, 10362, 1999, 1996, 3821, 1997, 16657, 1010, 2005, 2027, 2024, 11259, 1998, 4248, 2000, 4963, 1012, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
Steven Liu's avatar
Steven Liu committed
67
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Sylvain Gugger's avatar
Sylvain Gugger committed
68
69
```

Tobias Nusser's avatar
Tobias Nusser committed
70
The tokenizer returns a dictionary with three important items:
Steven Liu's avatar
Steven Liu committed
71
72
73
74

* [input_ids](glossary#input-ids) are the indices corresponding to each token in the sentence.
* [attention_mask](glossary#attention-mask) indicates whether a token should be attended to or not.
* [token_type_ids](glossary#token-type-ids) identifies which sequence a token belongs to when there is more than one sequence.
Sylvain Gugger's avatar
Sylvain Gugger committed
75

76
Return your input by decoding the `input_ids`:
Sylvain Gugger's avatar
Sylvain Gugger committed
77
78
79

```py
>>> tokenizer.decode(encoded_input["input_ids"])
Steven Liu's avatar
Steven Liu committed
80
'[CLS] Do not meddle in the affairs of wizards, for they are subtle and quick to anger. [SEP]'
Sylvain Gugger's avatar
Sylvain Gugger committed
81
82
```

Steven Liu's avatar
Steven Liu committed
83
As you can see, the tokenizer added two special tokens - `CLS` and `SEP` (classifier and separator) - to the sentence. Not all models need
84
special tokens, but if they do, the tokenizer automatically adds them for you.
Sylvain Gugger's avatar
Sylvain Gugger committed
85

86
If there are several sentences you want to preprocess, pass them as a list to the tokenizer:
Sylvain Gugger's avatar
Sylvain Gugger committed
87
88

```py
Steven Liu's avatar
Steven Liu committed
89
90
91
92
93
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
Sylvain Gugger's avatar
Sylvain Gugger committed
94
95
>>> encoded_inputs = tokenizer(batch_sentences)
>>> print(encoded_inputs)
96
97
98
99
100
101
102
103
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102],
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
               [101, 1327, 1164, 5450, 23434, 136, 102]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
Steven Liu's avatar
Steven Liu committed
104
                    [1, 1, 1, 1, 1, 1, 1]]}
Sylvain Gugger's avatar
Sylvain Gugger committed
105
106
```

Steven Liu's avatar
Steven Liu committed
107
### Pad
Sylvain Gugger's avatar
Sylvain Gugger committed
108

109
Sentences aren't always the same length which can be an issue because tensors, the model inputs, need to have a uniform shape. Padding is a strategy for ensuring tensors are rectangular by adding a special *padding token* to shorter sentences.
Sylvain Gugger's avatar
Sylvain Gugger committed
110

Steven Liu's avatar
Steven Liu committed
111
Set the `padding` parameter to `True` to pad the shorter sequences in the batch to match the longest sequence:
Sylvain Gugger's avatar
Sylvain Gugger committed
112

Steven Liu's avatar
Steven Liu committed
113
114
115
116
117
118
119
120
```py
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_input = tokenizer(batch_sentences, padding=True)
>>> print(encoded_input)
121
122
123
124
125
126
127
128
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
Steven Liu's avatar
Steven Liu committed
129
130
131
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
```

132
The first and third sentences are now padded with `0`'s because they are shorter.
Steven Liu's avatar
Steven Liu committed
133
134
135

### Truncation

136
On the other end of the spectrum, sometimes a sequence may be too long for a model to handle. In this case, you'll need to truncate the sequence to a shorter length.
Steven Liu's avatar
Steven Liu committed
137
138

Set the `truncation` parameter to `True` to truncate a sequence to the maximum length accepted by the model:
Sylvain Gugger's avatar
Sylvain Gugger committed
139
140

```py
Steven Liu's avatar
Steven Liu committed
141
142
143
144
145
146
147
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True)
>>> print(encoded_input)
148
149
150
151
152
153
154
155
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
Steven Liu's avatar
Steven Liu committed
156
157
158
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
```

159
160
161
162
163
164
<Tip>

Check out the [Padding and truncation](./pad_truncation) concept guide to learn more different padding and truncation arguments.

</Tip>

Steven Liu's avatar
Steven Liu committed
165
166
### Build tensors

167
Finally, you want the tokenizer to return the actual tensors that get fed to the model.
Steven Liu's avatar
Steven Liu committed
168
169
170

Set the `return_tensors` parameter to either `pt` for PyTorch, or `tf` for TensorFlow:

Sylvain Gugger's avatar
Sylvain Gugger committed
171
172
173
<frameworkcontent>
<pt>

Steven Liu's avatar
Steven Liu committed
174
175
176
177
178
179
```py
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
180
>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
Steven Liu's avatar
Steven Liu committed
181
>>> print(encoded_input)
182
183
{'input_ids': tensor([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
                      [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
184
                      [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]]),
185
186
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
187
                           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
188
189
190
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
                           [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                           [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}
Sylvain Gugger's avatar
Sylvain Gugger committed
191
192
193
194
```
</pt>
<tf>
```py
Steven Liu's avatar
Steven Liu committed
195
196
197
198
199
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
200
>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf")
Steven Liu's avatar
Steven Liu committed
201
202
>>> print(encoded_input)
{'input_ids': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
203
204
205
array([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
       [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
       [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
206
      dtype=int32)>,
Steven Liu's avatar
Steven Liu committed
207
 'token_type_ids': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
208
209
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
210
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>,
Steven Liu's avatar
Steven Liu committed
211
 'attention_mask': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
212
213
214
array([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>}
Sylvain Gugger's avatar
Sylvain Gugger committed
215
```
Sylvain Gugger's avatar
Sylvain Gugger committed
216
217
</tf>
</frameworkcontent>
Sylvain Gugger's avatar
Sylvain Gugger committed
218

Steven Liu's avatar
Steven Liu committed
219
## Audio
Sylvain Gugger's avatar
Sylvain Gugger committed
220

221
For audio tasks, you'll need a [feature extractor](main_classes/feature_extractor) to prepare your dataset for the model. The feature extractor is designed to extract features from raw audio data, and convert them into tensors.
Sylvain Gugger's avatar
Sylvain Gugger committed
222

223
Load the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset (see the 馃 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub) for more details on how to load a dataset) to see how you can use a feature extractor with audio datasets:
Sylvain Gugger's avatar
Sylvain Gugger committed
224

Steven Liu's avatar
Steven Liu committed
225
226
```py
>>> from datasets import load_dataset, Audio
Sylvain Gugger's avatar
Sylvain Gugger committed
227

228
>>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")
Steven Liu's avatar
Steven Liu committed
229
```
Sylvain Gugger's avatar
Sylvain Gugger committed
230

231
Access the first element of the `audio` column to take a look at the input. Calling the `audio` column automatically loads and resamples the audio file:
Sylvain Gugger's avatar
Sylvain Gugger committed
232

Steven Liu's avatar
Steven Liu committed
233
```py
234
235
236
237
238
>>> dataset[0]["audio"]
{'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,
         0.        ,  0.        ], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 'sampling_rate': 8000}
Steven Liu's avatar
Steven Liu committed
239
```
Sylvain Gugger's avatar
Sylvain Gugger committed
240

Steven Liu's avatar
Steven Liu committed
241
242
243
244
245
246
This returns three items:

* `array` is the speech signal loaded - and potentially resampled - as a 1D array.
* `path` points to the location of the audio file.
* `sampling_rate` refers to how many data points in the speech signal are measured per second.

247
For this tutorial, you'll use the [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) model. Take a look at the model card, and you'll learn Wav2Vec2 is pretrained on 16kHz sampled speech audio. It is important your audio data's sampling rate matches the sampling rate of the dataset used to pretrain the model. If your data's sampling rate isn't the same, then you need to resample your data.
Sylvain Gugger's avatar
Sylvain Gugger committed
248

Sylvain Gugger's avatar
Sylvain Gugger committed
249
1. Use 馃 Datasets' [`~datasets.Dataset.cast_column`] method to upsample the sampling rate to 16kHz:
Sylvain Gugger's avatar
Sylvain Gugger committed
250

Steven Liu's avatar
Steven Liu committed
251
```py
252
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
Steven Liu's avatar
Steven Liu committed
253
254
```

255
2. Call the `audio` column again to resample the audio file:
Sylvain Gugger's avatar
Sylvain Gugger committed
256
257

```py
258
259
260
261
>>> dataset[0]["audio"]
{'array': array([ 2.3443763e-05,  2.1729663e-04,  2.2145823e-04, ...,
         3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
Steven Liu's avatar
Steven Liu committed
262
 'sampling_rate': 16000}
Sylvain Gugger's avatar
Sylvain Gugger committed
263
264
```

265
Next, load a feature extractor to normalize and pad the input. When padding textual data, a `0` is added for shorter sequences. The same idea applies to audio data. The feature extractor adds a `0` - interpreted as silence - to `array`.
Steven Liu's avatar
Steven Liu committed
266
267

Load the feature extractor with [`AutoFeatureExtractor.from_pretrained`]:
Sylvain Gugger's avatar
Sylvain Gugger committed
268
269

```py
Steven Liu's avatar
Steven Liu committed
270
271
272
273
274
275
276
277
>>> from transformers import AutoFeatureExtractor

>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")
```

Pass the audio `array` to the feature extractor. We also recommend adding the `sampling_rate` argument in the feature extractor in order to better debug any silent errors that may occur.

```py
278
>>> audio_input = [dataset[0]["audio"]["array"]]
Steven Liu's avatar
Steven Liu committed
279
>>> feature_extractor(audio_input, sampling_rate=16000)
280
281
{'input_values': [array([ 3.8106556e-04,  2.7506407e-03,  2.8015103e-03, ...,
        5.6335266e-04,  4.6588284e-06, -1.7142107e-04], dtype=float32)]}
Sylvain Gugger's avatar
Sylvain Gugger committed
282
283
```

Steven Liu's avatar
Steven Liu committed
284
Just like the tokenizer, you can apply padding or truncation to handle variable sequences in a batch. Take a look at the sequence length of these two audio samples:
Sylvain Gugger's avatar
Sylvain Gugger committed
285
286

```py
287
288
>>> dataset[0]["audio"]["array"].shape
(173398,)
Steven Liu's avatar
Steven Liu committed
289

290
291
>>> dataset[1]["audio"]["array"].shape
(106496,)
Sylvain Gugger's avatar
Sylvain Gugger committed
292
293
```

294
Create a function to preprocess the dataset so the audio samples are the same lengths. Specify a maximum sample length, and the feature extractor will either pad or truncate the sequences to match it:
Sylvain Gugger's avatar
Sylvain Gugger committed
295
296

```py
Steven Liu's avatar
Steven Liu committed
297
298
299
300
301
302
>>> def preprocess_function(examples):
...     audio_arrays = [x["array"] for x in examples["audio"]]
...     inputs = feature_extractor(
...         audio_arrays,
...         sampling_rate=16000,
...         padding=True,
303
...         max_length=100000,
Steven Liu's avatar
Steven Liu committed
304
305
306
307
308
...         truncation=True,
...     )
...     return inputs
```

309
Apply the `preprocess_function` to the first few examples in the dataset:
Steven Liu's avatar
Steven Liu committed
310
311

```py
312
>>> processed_dataset = preprocess_function(dataset[:5])
Steven Liu's avatar
Steven Liu committed
313
314
```

315
The sample lengths are now the same and match the specified maximum length. You can pass your processed dataset to the model now!
Steven Liu's avatar
Steven Liu committed
316
317
318

```py
>>> processed_dataset["input_values"][0].shape
319
(100000,)
Steven Liu's avatar
Steven Liu committed
320
321

>>> processed_dataset["input_values"][1].shape
322
(100000,)
Steven Liu's avatar
Steven Liu committed
323
324
```

325
326
## Computer vision

327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
For computer vision tasks, you'll need an [image processor](main_classes/image_processor) to prepare your dataset for the model.
Image preprocessing consists of several steps that convert images into the input expected by the model. These steps
include but are not limited to resizing, normalizing, color channel correction, and converting images to tensors.

<Tip>

Image preprocessing often follows some form of image augmentation. Both image preprocessing and image augmentation
transform image data, but they serve different purposes:

* Image augmentation alters images in a way that can help prevent overfitting and increase the robustness of the model. You can get creative in how you augment your data - adjust brightness and colors, crop, rotate, resize, zoom, etc. However, be mindful not to change the meaning of the images with your augmentations.
* Image preprocessing guarantees that the images match the model鈥檚 expected input format. When fine-tuning a computer vision model, images must be preprocessed exactly as when the model was initially trained.

You can use any library you like for image augmentation. For image preprocessing, use the `ImageProcessor` associated with the model.

</Tip>
Steven Liu's avatar
Steven Liu committed
342

343
Load the [food101](https://huggingface.co/datasets/food101) dataset (see the 馃 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub) for more details on how to load a dataset) to see how you can use an image processor with computer vision datasets:
Steven Liu's avatar
Steven Liu committed
344

345
346
347
<Tip>

Use 馃 Datasets `split` parameter to only load a small sample from the training split since the dataset is quite large!
Steven Liu's avatar
Steven Liu committed
348

349
</Tip>
Steven Liu's avatar
Steven Liu committed
350
351
352
353
354
355
356

```py
>>> from datasets import load_dataset

>>> dataset = load_dataset("food101", split="train[:100]")
```

357
Next, take a look at the image with 馃 Datasets [`Image`](https://huggingface.co/docs/datasets/package_reference/main_classes?highlight=image#datasets.Image) feature:
Steven Liu's avatar
Steven Liu committed
358
359
360
361
362

```py
>>> dataset[0]["image"]
```

363
364
365
<div class="flex justify-center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vision-preprocess-tutorial.png"/>
</div>
Steven Liu's avatar
Steven Liu committed
366

367
Load the image processor with [`AutoImageProcessor.from_pretrained`]:
Steven Liu's avatar
Steven Liu committed
368
369

```py
370
>>> from transformers import AutoImageProcessor
Steven Liu's avatar
Steven Liu committed
371

372
>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
Sylvain Gugger's avatar
Sylvain Gugger committed
373
374
```

375
First, let's add some image augmentation. You can use any library you prefer, but in this tutorial, we'll use torchvision's [`transforms`](https://pytorch.org/vision/stable/transforms.html) module. If you're interested in using another data augmentation library, learn how in the [Albumentations](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_albumentations.ipynb) or [Kornia notebooks](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_kornia.ipynb).
Steven Liu's avatar
Steven Liu committed
376

377
378
379
380
1. Here we use [`Compose`](https://pytorch.org/vision/master/generated/torchvision.transforms.Compose.html) to chain together a couple of
transforms - [`RandomResizedCrop`](https://pytorch.org/vision/main/generated/torchvision.transforms.RandomResizedCrop.html) and [`ColorJitter`](https://pytorch.org/vision/main/generated/torchvision.transforms.ColorJitter.html).
Note that for resizing, we can get the image size requirements from the `image_processor`. For some models, an exact height and
width are expected, for others only the `shortest_edge` is defined.
Steven Liu's avatar
Steven Liu committed
381
382

```py
383
>>> from torchvision.transforms import RandomResizedCrop, ColorJitter, Compose
Steven Liu's avatar
Steven Liu committed
384

amyeroberts's avatar
amyeroberts committed
385
>>> size = (
386
387
388
...     image_processor.size["shortest_edge"]
...     if "shortest_edge" in image_processor.size
...     else (image_processor.size["height"], image_processor.size["width"])
Steven Liu's avatar
Steven Liu committed
389
... )
390
391

>>> _transforms = Compose([RandomResizedCrop(size), ColorJitter(brightness=0.5, hue=0.5)])
Steven Liu's avatar
Steven Liu committed
392
393
```

394
395
396
2. The model accepts [`pixel_values`](model_doc/visionencoderdecoder#transformers.VisionEncoderDecoderModel.forward.pixel_values)
as its input. `ImageProcessor` can take care of normalizing the images, and generating appropriate tensors.
Create a function that combines image augmentation and image preprocessing for a batch of images and generates `pixel_values`:
Steven Liu's avatar
Steven Liu committed
397
398
399

```py
>>> def transforms(examples):
400
401
...     images = [_transforms(img.convert("RGB")) for img in examples["image"]]
...     examples["pixel_values"] = image_processor(images, do_resize=False, return_tensors="pt")["pixel_values"]
Steven Liu's avatar
Steven Liu committed
402
403
404
...     return examples
```

405
406
407
408
409
410
411
412
413
414
<Tip>

In the example above we set `do_resize=False` because we have already resized the images in the image augmentation transformation,
and leveraged the `size` attribute from the appropriate `image_processor`. If you do not resize images during image augmentation,
leave this parameter out. By default, `ImageProcessor` will handle the resizing.

If you wish to normalize images as a part of the augmentation transformation, use the `image_processor.image_mean`,
and `image_processor.image_std` values.
</Tip>

415
3. Then use 馃 Datasets[`~datasets.Dataset.set_transform`] to apply the transforms on the fly:
Steven Liu's avatar
Steven Liu committed
416
417
418
419
```py
>>> dataset.set_transform(transforms)
```

420
4. Now when you access the image, you'll notice the image processor has added `pixel_values`. You can pass your processed dataset to the model now!
Steven Liu's avatar
Steven Liu committed
421
422

```py
423
>>> dataset[0].keys()
Steven Liu's avatar
Steven Liu committed
424
425
```

426
Here is what the image looks like after the transforms are applied. The image has been randomly cropped and it's color properties are different.
Steven Liu's avatar
Steven Liu committed
427
428
429
430
431
432
433
434
435

```py
>>> import numpy as np
>>> import matplotlib.pyplot as plt

>>> img = dataset[0]["pixel_values"]
>>> plt.imshow(img.permute(1, 2, 0))
```

436
437
438
<div class="flex justify-center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/preprocessed_image.png"/>
</div>
Steven Liu's avatar
Steven Liu committed
439

440
441
442
443
444
445
446
447
448
449
450
<Tip>

For tasks like object detection, semantic segmentation, instance segmentation, and panoptic segmentation, `ImageProcessor`
offers post processing methods. These methods convert model's raw outputs into meaningful predictions such as bounding boxes,
or segmentation maps.

</Tip>

### Pad

In some cases, for instance, when fine-tuning [DETR](./model_doc/detr), the model applies scale augmentation at training
451
time. This may cause images to be different sizes in a batch. You can use [`DetrImageProcessor.pad`]
452
453
454
455
456
from [`DetrImageProcessor`] and define a custom `collate_fn` to batch images together.

```py
>>> def collate_fn(batch):
...     pixel_values = [item["pixel_values"] for item in batch]
457
...     encoding = image_processor.pad(pixel_values, return_tensors="pt")
458
459
460
461
462
463
464
465
...     labels = [item["labels"] for item in batch]
...     batch = {}
...     batch["pixel_values"] = encoding["pixel_values"]
...     batch["pixel_mask"] = encoding["pixel_mask"]
...     batch["labels"] = labels
...     return batch
```

Steven Liu's avatar
Steven Liu committed
466
467
## Multimodal

468
For tasks involving multimodal inputs, you'll need a [processor](main_classes/processors) to prepare your dataset for the model. A processor couples together two processing objects such as as tokenizer and feature extractor.
Steven Liu's avatar
Steven Liu committed
469

470
Load the [LJ Speech](https://huggingface.co/datasets/lj_speech) dataset (see the 馃 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub) for more details on how to load a dataset) to see how you can use a processor for automatic speech recognition (ASR):
Steven Liu's avatar
Steven Liu committed
471
472
473
474
475
476
477

```py
>>> from datasets import load_dataset

>>> lj_speech = load_dataset("lj_speech", split="train")
```

478
For ASR, you're mainly focused on `audio` and `text` so you can remove the other columns:
Steven Liu's avatar
Steven Liu committed
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496

```py
>>> lj_speech = lj_speech.map(remove_columns=["file", "id", "normalized_text"])
```

Now take a look at the `audio` and `text` columns:

```py
>>> lj_speech[0]["audio"]
{'array': array([-7.3242188e-04, -7.6293945e-04, -6.4086914e-04, ...,
         7.3242188e-04,  2.1362305e-04,  6.1035156e-05], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/917ece08c95cf0c4115e45294e3cd0dee724a1165b7fc11798369308a465bd26/LJSpeech-1.1/wavs/LJ001-0001.wav',
 'sampling_rate': 22050}

>>> lj_speech[0]["text"]
'Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition'
```

497
Remember you should always [resample](preprocessing#audio) your audio dataset's sampling rate to match the sampling rate of the dataset used to pretrain a model!
Steven Liu's avatar
Steven Liu committed
498
499
500
501
502

```py
>>> lj_speech = lj_speech.cast_column("audio", Audio(sampling_rate=16_000))
```

503
Load a processor with [`AutoProcessor.from_pretrained`]:
Steven Liu's avatar
Steven Liu committed
504
505
506
507
508
509
510

```py
>>> from transformers import AutoProcessor

>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
```

511
1. Create a function to process the audio data contained in `array` to `input_values`, and tokenize `text` to `labels`. These are the inputs to the model:
Steven Liu's avatar
Steven Liu committed
512
513
514
515
516

```py
>>> def prepare_dataset(example):
...     audio = example["audio"]

517
...     example.update(processor(audio=audio["array"], text=example["text"], sampling_rate=16000))
Steven Liu's avatar
Steven Liu committed
518
519
520
521
522
523
524
525
526
527

...     return example
```

2. Apply the `prepare_dataset` function to a sample:

```py
>>> prepare_dataset(lj_speech[0])
```

amyeroberts's avatar
amyeroberts committed
528
The processor has now added `input_values` and `labels`, and the sampling rate has also been correctly downsampled to 16kHz. You can pass your processed dataset to the model now!