"tests/models/swinv2/test_modeling_swinv2.py" did not exist on "3772af49ceba348f2c9c5bbbb7f7c12e35d2a6eb"
preprocessing.mdx 22.4 KB
Newer Older
Steven Liu's avatar
Steven Liu committed
1
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
Sylvain Gugger's avatar
Sylvain Gugger committed
2
3
4
5
6
7
8
9
10
11
12

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

Steven Liu's avatar
Steven Liu committed
13
# Preprocess
Sylvain Gugger's avatar
Sylvain Gugger committed
14

15
16
[[open-in-colab]]

17
Before you can train a model on a dataset, it needs to be preprocessed into the expected model input format. Whether your data is text, images, or audio, they need to be converted and assembled into batches of tensors. 馃 Transformers provides a set of preprocessing classes to help prepare your data for the model. In this tutorial, you'll learn that for:
Sylvain Gugger's avatar
Sylvain Gugger committed
18

19
* Text, use a [Tokenizer](./main_classes/tokenizer) to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors.
amyeroberts's avatar
amyeroberts committed
20
21
* Image inputs use a [ImageProcessor](./main_classes/image) to convert images into tensors.
* Speech and audio, use a [Feature extractor](./main_classes/feature_extractor) to extract sequential features from audio waveforms and convert them into tensors.
22
* Multimodal inputs, use a [Processor](./main_classes/processors) to combine a tokenizer and a feature extractor.
Steven Liu's avatar
Steven Liu committed
23

24
25
26
27
28
29
30
31
32
33
34
35
36
<Tip>

`AutoProcessor` **always** works and automatically chooses the correct class for the model you're using, whether you're using a tokenizer, feature extractor or processor.

</Tip>

Before you begin, install 馃 Datasets so you can load some datasets to experiment with:

```bash
pip install datasets
```

## Natural Language Processing
Steven Liu's avatar
Steven Liu committed
37
38
39

<Youtube id="Yffk5aydLzg"/>

40
The main tool for preprocessing textual data is a [tokenizer](main_classes/tokenizer). A tokenizer splits text into *tokens* according to a set of rules. The tokens are converted into numbers and then tensors, which become the model inputs. Any additional inputs required by the model are added by the tokenizer.
Sylvain Gugger's avatar
Sylvain Gugger committed
41
42
43

<Tip>

Steven Liu's avatar
Steven Liu committed
44
If you plan on using a pretrained model, it's important to use the associated pretrained tokenizer. This ensures the text is split the same way as the pretraining corpus, and uses the same corresponding tokens-to-index (usually referrred to as the *vocab*) during pretraining.
Sylvain Gugger's avatar
Sylvain Gugger committed
45
46
47

</Tip>

48
Get started by loading a pretrained tokenizer with the [`AutoTokenizer.from_pretrained`] method. This downloads the *vocab* a model was pretrained with:
Sylvain Gugger's avatar
Sylvain Gugger committed
49

Steven Liu's avatar
Steven Liu committed
50
51
```py
>>> from transformers import AutoTokenizer
Sylvain Gugger's avatar
Sylvain Gugger committed
52

Steven Liu's avatar
Steven Liu committed
53
54
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
```
Sylvain Gugger's avatar
Sylvain Gugger committed
55

56
Then pass your text to the tokenizer:
Sylvain Gugger's avatar
Sylvain Gugger committed
57
58

```py
Steven Liu's avatar
Steven Liu committed
59
>>> encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
Sylvain Gugger's avatar
Sylvain Gugger committed
60
>>> print(encoded_input)
Steven Liu's avatar
Steven Liu committed
61
62
63
{'input_ids': [101, 2079, 2025, 19960, 10362, 1999, 1996, 3821, 1997, 16657, 1010, 2005, 2027, 2024, 11259, 1998, 4248, 2000, 4963, 1012, 102], 
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Sylvain Gugger's avatar
Sylvain Gugger committed
64
65
```

Tobias Nusser's avatar
Tobias Nusser committed
66
The tokenizer returns a dictionary with three important items:
Steven Liu's avatar
Steven Liu committed
67
68
69
70

* [input_ids](glossary#input-ids) are the indices corresponding to each token in the sentence.
* [attention_mask](glossary#attention-mask) indicates whether a token should be attended to or not.
* [token_type_ids](glossary#token-type-ids) identifies which sequence a token belongs to when there is more than one sequence.
Sylvain Gugger's avatar
Sylvain Gugger committed
71

72
Return your input by decoding the `input_ids`:
Sylvain Gugger's avatar
Sylvain Gugger committed
73
74
75

```py
>>> tokenizer.decode(encoded_input["input_ids"])
Steven Liu's avatar
Steven Liu committed
76
'[CLS] Do not meddle in the affairs of wizards, for they are subtle and quick to anger. [SEP]'
Sylvain Gugger's avatar
Sylvain Gugger committed
77
78
```

Steven Liu's avatar
Steven Liu committed
79
As you can see, the tokenizer added two special tokens - `CLS` and `SEP` (classifier and separator) - to the sentence. Not all models need
80
special tokens, but if they do, the tokenizer automatically adds them for you.
Sylvain Gugger's avatar
Sylvain Gugger committed
81

82
If there are several sentences you want to preprocess, pass them as a list to the tokenizer:
Sylvain Gugger's avatar
Sylvain Gugger committed
83
84

```py
Steven Liu's avatar
Steven Liu committed
85
86
87
88
89
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
Sylvain Gugger's avatar
Sylvain Gugger committed
90
91
>>> encoded_inputs = tokenizer(batch_sentences)
>>> print(encoded_inputs)
Steven Liu's avatar
Steven Liu committed
92
93
94
95
96
97
98
99
100
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1]]}
Sylvain Gugger's avatar
Sylvain Gugger committed
101
102
```

Steven Liu's avatar
Steven Liu committed
103
### Pad
Sylvain Gugger's avatar
Sylvain Gugger committed
104

105
Sentences aren't always the same length which can be an issue because tensors, the model inputs, need to have a uniform shape. Padding is a strategy for ensuring tensors are rectangular by adding a special *padding token* to shorter sentences.
Sylvain Gugger's avatar
Sylvain Gugger committed
106

Steven Liu's avatar
Steven Liu committed
107
Set the `padding` parameter to `True` to pad the shorter sequences in the batch to match the longest sequence:
Sylvain Gugger's avatar
Sylvain Gugger committed
108

Steven Liu's avatar
Steven Liu committed
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
```py
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_input = tokenizer(batch_sentences, padding=True)
>>> print(encoded_input)
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
```

128
The first and third sentences are now padded with `0`'s because they are shorter.
Steven Liu's avatar
Steven Liu committed
129
130
131

### Truncation

132
On the other end of the spectrum, sometimes a sequence may be too long for a model to handle. In this case, you'll need to truncate the sequence to a shorter length.
Steven Liu's avatar
Steven Liu committed
133
134

Set the `truncation` parameter to `True` to truncate a sequence to the maximum length accepted by the model:
Sylvain Gugger's avatar
Sylvain Gugger committed
135
136

```py
Steven Liu's avatar
Steven Liu committed
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True)
>>> print(encoded_input)
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
```

155
156
157
158
159
160
<Tip>

Check out the [Padding and truncation](./pad_truncation) concept guide to learn more different padding and truncation arguments.

</Tip>

Steven Liu's avatar
Steven Liu committed
161
162
### Build tensors

163
Finally, you want the tokenizer to return the actual tensors that get fed to the model.
Steven Liu's avatar
Steven Liu committed
164
165
166

Set the `return_tensors` parameter to either `pt` for PyTorch, or `tf` for TensorFlow:

Sylvain Gugger's avatar
Sylvain Gugger committed
167
168
169
<frameworkcontent>
<pt>

Steven Liu's avatar
Steven Liu committed
170
171
172
173
174
175
```py
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
176
>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
Steven Liu's avatar
Steven Liu committed
177
>>> print(encoded_input)
178
179
180
181
182
183
184
185
186
{'input_ids': tensor([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
                      [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
                      [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]]), 
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
                           [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                           [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}
Sylvain Gugger's avatar
Sylvain Gugger committed
187
188
189
190
```
</pt>
<tf>
```py
Steven Liu's avatar
Steven Liu committed
191
192
193
194
195
>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
196
>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf")
Steven Liu's avatar
Steven Liu committed
197
198
>>> print(encoded_input)
{'input_ids': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
199
200
201
array([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
       [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
       [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
Steven Liu's avatar
Steven Liu committed
202
203
      dtype=int32)>, 
 'token_type_ids': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
204
205
206
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>, 
Steven Liu's avatar
Steven Liu committed
207
 'attention_mask': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
208
209
210
array([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>}
Sylvain Gugger's avatar
Sylvain Gugger committed
211
```
Sylvain Gugger's avatar
Sylvain Gugger committed
212
213
</tf>
</frameworkcontent>
Sylvain Gugger's avatar
Sylvain Gugger committed
214

Steven Liu's avatar
Steven Liu committed
215
## Audio
Sylvain Gugger's avatar
Sylvain Gugger committed
216

217
For audio tasks, you'll need a [feature extractor](main_classes/feature_extractor) to prepare your dataset for the model. The feature extractor is designed to extract features from raw audio data, and convert them into tensors.
Sylvain Gugger's avatar
Sylvain Gugger committed
218

219
Load the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset (see the 馃 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset) to see how you can use a feature extractor with audio datasets:
Sylvain Gugger's avatar
Sylvain Gugger committed
220

Steven Liu's avatar
Steven Liu committed
221
222
```py
>>> from datasets import load_dataset, Audio
Sylvain Gugger's avatar
Sylvain Gugger committed
223

224
>>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")
Steven Liu's avatar
Steven Liu committed
225
```
Sylvain Gugger's avatar
Sylvain Gugger committed
226

227
Access the first element of the `audio` column to take a look at the input. Calling the `audio` column automatically loads and resamples the audio file:
Sylvain Gugger's avatar
Sylvain Gugger committed
228

Steven Liu's avatar
Steven Liu committed
229
```py
230
231
232
233
234
>>> dataset[0]["audio"]
{'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,
         0.        ,  0.        ], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 'sampling_rate': 8000}
Steven Liu's avatar
Steven Liu committed
235
```
Sylvain Gugger's avatar
Sylvain Gugger committed
236

Steven Liu's avatar
Steven Liu committed
237
238
239
240
241
242
This returns three items:

* `array` is the speech signal loaded - and potentially resampled - as a 1D array.
* `path` points to the location of the audio file.
* `sampling_rate` refers to how many data points in the speech signal are measured per second.

243
For this tutorial, you'll use the [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) model. Take a look at the model card, and you'll learn Wav2Vec2 is pretrained on 16kHz sampled speech audio. It is important your audio data's sampling rate matches the sampling rate of the dataset used to pretrain the model. If your data's sampling rate isn't the same, then you need to resample your data. 
Sylvain Gugger's avatar
Sylvain Gugger committed
244

Sylvain Gugger's avatar
Sylvain Gugger committed
245
1. Use 馃 Datasets' [`~datasets.Dataset.cast_column`] method to upsample the sampling rate to 16kHz:
Sylvain Gugger's avatar
Sylvain Gugger committed
246

Steven Liu's avatar
Steven Liu committed
247
```py
248
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
Steven Liu's avatar
Steven Liu committed
249
250
```

251
2. Call the `audio` column again to resample the audio file:
Sylvain Gugger's avatar
Sylvain Gugger committed
252
253

```py
254
255
256
257
>>> dataset[0]["audio"]
{'array': array([ 2.3443763e-05,  2.1729663e-04,  2.2145823e-04, ...,
         3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
Steven Liu's avatar
Steven Liu committed
258
 'sampling_rate': 16000}
Sylvain Gugger's avatar
Sylvain Gugger committed
259
260
```

261
Next, load a feature extractor to normalize and pad the input. When padding textual data, a `0` is added for shorter sequences. The same idea applies to audio data. The feature extractor adds a `0` - interpreted as silence - to `array`.
Steven Liu's avatar
Steven Liu committed
262
263

Load the feature extractor with [`AutoFeatureExtractor.from_pretrained`]:
Sylvain Gugger's avatar
Sylvain Gugger committed
264
265

```py
Steven Liu's avatar
Steven Liu committed
266
267
268
269
270
271
272
273
>>> from transformers import AutoFeatureExtractor

>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")
```

Pass the audio `array` to the feature extractor. We also recommend adding the `sampling_rate` argument in the feature extractor in order to better debug any silent errors that may occur.

```py
274
>>> audio_input = [dataset[0]["audio"]["array"]]
Steven Liu's avatar
Steven Liu committed
275
>>> feature_extractor(audio_input, sampling_rate=16000)
276
277
{'input_values': [array([ 3.8106556e-04,  2.7506407e-03,  2.8015103e-03, ...,
        5.6335266e-04,  4.6588284e-06, -1.7142107e-04], dtype=float32)]}
Sylvain Gugger's avatar
Sylvain Gugger committed
278
279
```

Steven Liu's avatar
Steven Liu committed
280
Just like the tokenizer, you can apply padding or truncation to handle variable sequences in a batch. Take a look at the sequence length of these two audio samples:
Sylvain Gugger's avatar
Sylvain Gugger committed
281
282

```py
283
284
>>> dataset[0]["audio"]["array"].shape
(173398,)
Steven Liu's avatar
Steven Liu committed
285

286
287
>>> dataset[1]["audio"]["array"].shape
(106496,)
Sylvain Gugger's avatar
Sylvain Gugger committed
288
289
```

290
Create a function to preprocess the dataset so the audio samples are the same lengths. Specify a maximum sample length, and the feature extractor will either pad or truncate the sequences to match it:
Sylvain Gugger's avatar
Sylvain Gugger committed
291
292

```py
Steven Liu's avatar
Steven Liu committed
293
294
295
296
297
298
>>> def preprocess_function(examples):
...     audio_arrays = [x["array"] for x in examples["audio"]]
...     inputs = feature_extractor(
...         audio_arrays,
...         sampling_rate=16000,
...         padding=True,
299
...         max_length=100000,
Steven Liu's avatar
Steven Liu committed
300
301
302
303
304
...         truncation=True,
...     )
...     return inputs
```

305
Apply the `preprocess_function` to the the first few examples in the dataset:
Steven Liu's avatar
Steven Liu committed
306
307

```py
308
>>> processed_dataset = preprocess_function(dataset[:5])
Steven Liu's avatar
Steven Liu committed
309
310
```

311
The sample lengths are now the same and match the specified maximum length. You can pass your processed dataset to the model now!
Steven Liu's avatar
Steven Liu committed
312
313
314

```py
>>> processed_dataset["input_values"][0].shape
315
(100000,)
Steven Liu's avatar
Steven Liu committed
316
317

>>> processed_dataset["input_values"][1].shape
318
(100000,)
Steven Liu's avatar
Steven Liu committed
319
320
```

321
322
323
## Computer vision

For computer vision tasks, you'll need a [feature extractor](main_classes/feature_extractor) to prepare your dataset for the model. The feature extractor is designed to extract features from images, and convert them into tensors.
Steven Liu's avatar
Steven Liu committed
324

325
Load the [food101](https://huggingface.co/datasets/food101) dataset (see the 馃 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset) to see how you can use a feature extractor with computer vision datasets: 
Steven Liu's avatar
Steven Liu committed
326

327
328
329
<Tip>

Use 馃 Datasets `split` parameter to only load a small sample from the training split since the dataset is quite large!
Steven Liu's avatar
Steven Liu committed
330

331
</Tip>
Steven Liu's avatar
Steven Liu committed
332
333
334
335
336
337
338
339
340
341
342
343
344

```py
>>> from datasets import load_dataset

>>> dataset = load_dataset("food101", split="train[:100]")
```

Next, take a look at the image with 馃 Datasets [`Image`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=image#datasets.Image) feature:

```py
>>> dataset[0]["image"]
```

345
346
347
<div class="flex justify-center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vision-preprocess-tutorial.png"/>
</div>
Steven Liu's avatar
Steven Liu committed
348
349
350
351
352
353
354

Load the feature extractor with [`AutoFeatureExtractor.from_pretrained`]:

```py
>>> from transformers import AutoFeatureExtractor

>>> feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224")
Sylvain Gugger's avatar
Sylvain Gugger committed
355
356
```

357
For computer vision tasks, it is common to add some type of data augmentation to the images as a part of preprocessing. You can add augmentations with any library you'd like, but in this tutorial, you'll use torchvision's [`transforms`](https://pytorch.org/vision/stable/transforms.html) module. If you're interested in using another data augmentation library, learn how in the [Albumentations](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_albumentations.ipynb) or [Kornia notebooks](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_kornia.ipynb).
Steven Liu's avatar
Steven Liu committed
358

359
1. Normalize the image with the feature extractor and use [`Compose`](https://pytorch.org/vision/master/generated/torchvision.transforms.Compose.html) to chain some transforms - [`RandomResizedCrop`](https://pytorch.org/vision/main/generated/torchvision.transforms.RandomResizedCrop.html) and [`ColorJitter`](https://pytorch.org/vision/main/generated/torchvision.transforms.ColorJitter.html) - together:
Steven Liu's avatar
Steven Liu committed
360
361
362
363
364

```py
>>> from torchvision.transforms import Compose, Normalize, RandomResizedCrop, ColorJitter, ToTensor

>>> normalize = Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std)
amyeroberts's avatar
amyeroberts committed
365
366
367
368
>>> size = (
...     feature_extractor.size["shortest_edge"]
...     if "shortest_edge" in feature_extractor.size
...     else (feature_extractor.size["height"], feature_extractor.size["width"])
Steven Liu's avatar
Steven Liu committed
369
... )
amyeroberts's avatar
amyeroberts committed
370
>>> _transforms = Compose([RandomResizedCrop(size), ColorJitter(brightness=0.5, hue=0.5), ToTensor(), normalize])
Steven Liu's avatar
Steven Liu committed
371
372
```

373
2. The model accepts [`pixel_values`](model_doc/visionencoderdecoder#transformers.VisionEncoderDecoderModel.forward.pixel_values) as its input, which is generated by the feature extractor. Create a function that generates `pixel_values` from the transforms:
Steven Liu's avatar
Steven Liu committed
374
375
376
377
378
379
380

```py
>>> def transforms(examples):
...     examples["pixel_values"] = [_transforms(image.convert("RGB")) for image in examples["image"]]
...     return examples
```

381
3. Then use 馃 Datasets [`set_transform`](https://huggingface.co/docs/datasets/process.html#format-transform) to apply the transforms on the fly:
Steven Liu's avatar
Steven Liu committed
382
383
384
385
386

```py
>>> dataset.set_transform(transforms)
```

387
4. Now when you access the image, you'll notice the feature extractor has added `pixel_values`. You can pass your processed dataset to the model now!
Steven Liu's avatar
Steven Liu committed
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417

```py
>>> dataset[0]["image"]
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F1A7B0630D0>,
 'label': 6,
 'pixel_values': tensor([[[ 0.0353,  0.0745,  0.1216,  ..., -0.9922, -0.9922, -0.9922],
          [-0.0196,  0.0667,  0.1294,  ..., -0.9765, -0.9843, -0.9922],
          [ 0.0196,  0.0824,  0.1137,  ..., -0.9765, -0.9686, -0.8667],
          ...,
          [ 0.0275,  0.0745,  0.0510,  ..., -0.1137, -0.1216, -0.0824],
          [ 0.0667,  0.0824,  0.0667,  ..., -0.0588, -0.0745, -0.0980],
          [ 0.0353,  0.0353,  0.0431,  ..., -0.0039, -0.0039, -0.0588]],
 
         [[ 0.2078,  0.2471,  0.2863,  ..., -0.9451, -0.9373, -0.9451],
          [ 0.1608,  0.2471,  0.3098,  ..., -0.9373, -0.9451, -0.9373],
          [ 0.2078,  0.2706,  0.3020,  ..., -0.9608, -0.9373, -0.8275],
          ...,
          [-0.0353,  0.0118, -0.0039,  ..., -0.2392, -0.2471, -0.2078],
          [ 0.0196,  0.0353,  0.0196,  ..., -0.1843, -0.2000, -0.2235],
          [-0.0118, -0.0039, -0.0039,  ..., -0.0980, -0.0980, -0.1529]],
 
         [[ 0.3961,  0.4431,  0.4980,  ..., -0.9216, -0.9137, -0.9216],
          [ 0.3569,  0.4510,  0.5216,  ..., -0.9059, -0.9137, -0.9137],
          [ 0.4118,  0.4745,  0.5216,  ..., -0.9137, -0.8902, -0.7804],
          ...,
          [-0.2314, -0.1922, -0.2078,  ..., -0.4196, -0.4275, -0.3882],
          [-0.1843, -0.1686, -0.2000,  ..., -0.3647, -0.3804, -0.4039],
          [-0.1922, -0.1922, -0.1922,  ..., -0.2941, -0.2863, -0.3412]]])}
```

418
Here is what the image looks like after the transforms are applied. The image has been randomly cropped and it's color properties are different.
Steven Liu's avatar
Steven Liu committed
419
420
421
422
423
424
425
426
427

```py
>>> import numpy as np
>>> import matplotlib.pyplot as plt

>>> img = dataset[0]["pixel_values"]
>>> plt.imshow(img.permute(1, 2, 0))
```

428
429
430
<div class="flex justify-center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/preprocessed_image.png"/>
</div>
Steven Liu's avatar
Steven Liu committed
431
432
433

## Multimodal

434
For tasks involving multimodal inputs, you'll need a [processor](main_classes/processors) to prepare your dataset for the model. A processor couples a tokenizer and feature extractor.
Steven Liu's avatar
Steven Liu committed
435

436
Load the [LJ Speech](https://huggingface.co/datasets/lj_speech) dataset (see the 馃 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset) to see how you can use a processor for automatic speech recognition (ASR):
Steven Liu's avatar
Steven Liu committed
437
438
439
440
441
442
443

```py
>>> from datasets import load_dataset

>>> lj_speech = load_dataset("lj_speech", split="train")
```

444
For ASR, you're mainly focused on `audio` and `text` so you can remove the other columns:
Steven Liu's avatar
Steven Liu committed
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462

```py
>>> lj_speech = lj_speech.map(remove_columns=["file", "id", "normalized_text"])
```

Now take a look at the `audio` and `text` columns:

```py
>>> lj_speech[0]["audio"]
{'array': array([-7.3242188e-04, -7.6293945e-04, -6.4086914e-04, ...,
         7.3242188e-04,  2.1362305e-04,  6.1035156e-05], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/917ece08c95cf0c4115e45294e3cd0dee724a1165b7fc11798369308a465bd26/LJSpeech-1.1/wavs/LJ001-0001.wav',
 'sampling_rate': 22050}

>>> lj_speech[0]["text"]
'Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition'
```

463
Remember you should always [resample](preprocessing#audio) your audio dataset's sampling rate to match the sampling rate of the dataset used to pretrain a model!
Steven Liu's avatar
Steven Liu committed
464
465
466
467
468

```py
>>> lj_speech = lj_speech.cast_column("audio", Audio(sampling_rate=16_000))
```

469
Load a processor with [`AutoProcessor.from_pretrained`]:
Steven Liu's avatar
Steven Liu committed
470
471
472
473
474
475
476

```py
>>> from transformers import AutoProcessor

>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
```

477
1. Create a function to process the audio data contained in `array` to `input_values`, and tokenize `text` to `labels`. These are the inputs to the model:
Steven Liu's avatar
Steven Liu committed
478
479
480
481
482

```py
>>> def prepare_dataset(example):
...     audio = example["audio"]

483
...     example.update(processor(audio=audio["array"], text=example["text"], sampling_rate=16000))
Steven Liu's avatar
Steven Liu committed
484
485
486
487
488
489
490
491
492
493

...     return example
```

2. Apply the `prepare_dataset` function to a sample:

```py
>>> prepare_dataset(lj_speech[0])
```

amyeroberts's avatar
amyeroberts committed
494
The processor has now added `input_values` and `labels`, and the sampling rate has also been correctly downsampled to 16kHz. You can pass your processed dataset to the model now!