image_classification.mdx 19.3 KB
Newer Older
Steven Liu's avatar
Steven Liu committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
<!--Copyright 2022 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# Image classification

15
16
[[open-in-colab]]

Steven Liu's avatar
Steven Liu committed
17
18
<Youtube id="tjAIM7BOYhw"/>

19
20
21
Image classification assigns a label or class to an image. Unlike text or audio classification, the inputs are the
pixel values that comprise an image. There are many applications for image classification, such as detecting damage
after a natural disaster, monitoring crop health, or helping screen medical images for signs of disease.
22

23
This guide illustrates how to:
Steven Liu's avatar
Steven Liu committed
24

25
1. Fine-tune [ViT](model_doc/vit) on the [Food-101](https://huggingface.co/datasets/food101) dataset to classify a food item in an image.
26
2. Use your fine-tuned model for inference.
Steven Liu's avatar
Steven Liu committed
27
28

<Tip>
29
The task illustrated in this tutorial is supported by the following model architectures:
Steven Liu's avatar
Steven Liu committed
30

31
32
<!--This tip is automatically generated by `make fix-copies`, do not fill manually!-->

NielsRogge's avatar
NielsRogge committed
33
[BEiT](../model_doc/beit), [BiT](../model_doc/bit), [ConvNeXT](../model_doc/convnext), [ConvNeXTV2](../model_doc/convnextv2), [CvT](../model_doc/cvt), [Data2VecVision](../model_doc/data2vec-vision), [DeiT](../model_doc/deit), [DiNAT](../model_doc/dinat), [EfficientFormer](../model_doc/efficientformer), [EfficientNet](../model_doc/efficientnet), [FocalNet](../model_doc/focalnet), [ImageGPT](../model_doc/imagegpt), [LeViT](../model_doc/levit), [MobileNetV1](../model_doc/mobilenet_v1), [MobileNetV2](../model_doc/mobilenet_v2), [MobileViT](../model_doc/mobilevit), [NAT](../model_doc/nat), [Perceiver](../model_doc/perceiver), [PoolFormer](../model_doc/poolformer), [RegNet](../model_doc/regnet), [ResNet](../model_doc/resnet), [SegFormer](../model_doc/segformer), [Swin Transformer](../model_doc/swin), [Swin Transformer V2](../model_doc/swinv2), [VAN](../model_doc/van), [ViT](../model_doc/vit), [ViT Hybrid](../model_doc/vit_hybrid), [ViTMSN](../model_doc/vit_msn)
34
<!--End of the generated tip-->
Steven Liu's avatar
Steven Liu committed
35
36
37

</Tip>

38
39
40
41
42
43
Before you begin, make sure you have all the necessary libraries installed:

```bash
pip install transformers datasets evaluate
```

44
We encourage you to log in to your Hugging Face account to upload and share your model with the community. When prompted, enter your token to log in:
45
46
47
48
49
50
51

```py
>>> from huggingface_hub import notebook_login

>>> notebook_login()
```

Steven Liu's avatar
Steven Liu committed
52
53
## Load Food-101 dataset

54
55
Start by loading a smaller subset of the Food-101 dataset from the 馃 Datasets library. This will give you a chance to
experiment and make sure everything works before spending more time training on the full dataset.
Steven Liu's avatar
Steven Liu committed
56
57
58
59
60
61
62

```py
>>> from datasets import load_dataset

>>> food = load_dataset("food101", split="train[:5000]")
```

63
Split the dataset's `train` split into a train and test set with the [`~datasets.Dataset.train_test_split`] method:
Steven Liu's avatar
Steven Liu committed
64
65
66
67
68
69
70
71
72
73
74
75
76

```py
>>> food = food.train_test_split(test_size=0.2)
```

Then take a look at an example:

```py
>>> food["train"][0]
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F52AFC8AC50>,
 'label': 79}
```

77
Each example in the dataset has two fields:
78

79
80
- `image`: a PIL image of the food item
- `label`: the label class of the food item
81

82
83
To make it easier for the model to get the label name from the label id, create a dictionary that maps the label name
to an integer and vice versa:
Steven Liu's avatar
Steven Liu committed
84
85
86
87
88
89
90
91
92

```py
>>> labels = food["train"].features["label"].names
>>> label2id, id2label = dict(), dict()
>>> for i, label in enumerate(labels):
...     label2id[label] = str(i)
...     id2label[str(i)] = label
```

93
Now you can convert the label id to a label name:
Steven Liu's avatar
Steven Liu committed
94
95
96
97
98
99
100
101

```py
>>> id2label[str(79)]
'prime_rib'
```

## Preprocess

102
The next step is to load a ViT image processor to process the image into a tensor:
Steven Liu's avatar
Steven Liu committed
103
104

```py
105
>>> from transformers import AutoImageProcessor
Steven Liu's avatar
Steven Liu committed
106

107
108
>>> checkpoint = "google/vit-base-patch16-224-in21k"
>>> image_processor = AutoImageProcessor.from_pretrained(checkpoint)
Steven Liu's avatar
Steven Liu committed
109
110
```

111
112
<frameworkcontent>
<pt>
113
Apply some image transformations to the images to make the model more robust against overfitting. Here you'll use torchvision's [`transforms`](https://pytorch.org/vision/stable/transforms.html) module, but you can also use any image library you like.
114
115

Crop a random part of the image, resize it, and normalize it with the image mean and standard deviation:
Steven Liu's avatar
Steven Liu committed
116
117
118
119

```py
>>> from torchvision.transforms import RandomResizedCrop, Compose, Normalize, ToTensor

120
>>> normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std)
amyeroberts's avatar
amyeroberts committed
121
>>> size = (
122
123
124
...     image_processor.size["shortest_edge"]
...     if "shortest_edge" in image_processor.size
...     else (image_processor.size["height"], image_processor.size["width"])
amyeroberts's avatar
amyeroberts committed
125
126
... )
>>> _transforms = Compose([RandomResizedCrop(size), ToTensor(), normalize])
Steven Liu's avatar
Steven Liu committed
127
128
```

129
Then create a preprocessing function to apply the transforms and return the `pixel_values` - the inputs to the model - of the image:
Steven Liu's avatar
Steven Liu committed
130
131
132
133
134
135
136
137

```py
>>> def transforms(examples):
...     examples["pixel_values"] = [_transforms(img.convert("RGB")) for img in examples["image"]]
...     del examples["image"]
...     return examples
```

138
To apply the preprocessing function over the entire dataset, use 馃 Datasets [`~datasets.Dataset.with_transform`] method. The transforms are applied on the fly when you load an element of the dataset:
Steven Liu's avatar
Steven Liu committed
139
140
141
142
143

```py
>>> food = food.with_transform(transforms)
```

144
Now create a batch of examples using [`DefaultDataCollator`]. Unlike other data collators in 馃 Transformers, the `DefaultDataCollator` does not apply additional preprocessing such as padding.
Steven Liu's avatar
Steven Liu committed
145
146
147
148
149
150

```py
>>> from transformers import DefaultDataCollator

>>> data_collator = DefaultDataCollator()
```
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
</pt>
</frameworkcontent>


<frameworkcontent>
<tf>

To avoid overfitting and to make the model more robust, add some data augmentation to the training part of the dataset.
Here we use Keras preprocessing layers to define the transformations for the training data (includes data augmentation),
and transformations for the validation data (only center cropping, resizing and normalizing). You can use `tf.image`or
any other library you prefer.

```py
>>> from tensorflow import keras
>>> from tensorflow.keras import layers

>>> size = (image_processor.size["height"], image_processor.size["width"])

>>> train_data_augmentation = keras.Sequential(
...     [
...         layers.RandomCrop(size[0], size[1]),
...         layers.Rescaling(scale=1.0 / 127.5, offset=-1),
...         layers.RandomFlip("horizontal"),
...         layers.RandomRotation(factor=0.02),
...         layers.RandomZoom(height_factor=0.2, width_factor=0.2),
...     ],
...     name="train_data_augmentation",
... )

>>> val_data_augmentation = keras.Sequential(
...     [
...         layers.CenterCrop(size[0], size[1]),
...         layers.Rescaling(scale=1.0 / 127.5, offset=-1),
...     ],
...     name="val_data_augmentation",
... )
```

Next, create functions to apply appropriate transformations to a batch of images, instead of one image at a time.

```py
>>> import numpy as np
>>> import tensorflow as tf
>>> from PIL import Image


>>> def convert_to_tf_tensor(image: Image):
...     np_image = np.array(image)
...     tf_image = tf.convert_to_tensor(np_image)
...     # `expand_dims()` is used to add a batch dimension since
...     # the TF augmentation layers operates on batched inputs.
...     return tf.expand_dims(tf_image, 0)


>>> def preprocess_train(example_batch):
...     """Apply train_transforms across a batch."""
...     images = [
...         train_data_augmentation(convert_to_tf_tensor(image.convert("RGB"))) for image in example_batch["image"]
...     ]
...     example_batch["pixel_values"] = [tf.transpose(tf.squeeze(image)) for image in images]
...     return example_batch


... def preprocess_val(example_batch):
...     """Apply val_transforms across a batch."""
...     images = [
...         val_data_augmentation(convert_to_tf_tensor(image.convert("RGB"))) for image in example_batch["image"]
...     ]
...     example_batch["pixel_values"] = [tf.transpose(tf.squeeze(image)) for image in images]
...     return example_batch
```

Use 馃 Datasets [`~datasets.Dataset.set_transform`] to apply the transformations on the fly:

```py
food["train"].set_transform(preprocess_train)
food["test"].set_transform(preprocess_val)
```

As a final preprocessing step, create a batch of examples using `DefaultDataCollator`. Unlike other data collators in 馃 Transformers, the
`DefaultDataCollator` does not apply additional preprocessing, such as padding.

```py
>>> from transformers import DefaultDataCollator

>>> data_collator = DefaultDataCollator(return_tensors="tf")
```
</tf>
</frameworkcontent>
Steven Liu's avatar
Steven Liu committed
240

241
242
## Evaluate

243
244
245
Including a metric during training is often helpful for evaluating your model's performance. You can quickly load an
evaluation method with the 馃 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load
the [accuracy](https://huggingface.co/spaces/evaluate-metric/accuracy) metric (see the 馃 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric):
246
247
248
249
250
251
252
253
254
255
256
257
258
259

```py
>>> import evaluate

>>> accuracy = evaluate.load("accuracy")
```

Then create a function that passes your predictions and labels to [`~evaluate.EvaluationModule.compute`] to calculate the accuracy:

```py
>>> import numpy as np


>>> def compute_metrics(eval_pred):
260
261
262
...     predictions, labels = eval_pred
...     predictions = np.argmax(predictions, axis=1)
...     return accuracy.compute(predictions=predictions, references=labels)
263
264
```

265
Your `compute_metrics` function is ready to go now, and you'll return to it when you set up your training.
266

267
## Train
Steven Liu's avatar
Steven Liu committed
268

269
270
<frameworkcontent>
<pt>
271
272
273
274
275
<Tip>

If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#train-with-pytorch-trainer)!

</Tip>
276

277
You're ready to start training your model now! Load ViT with [`AutoModelForImageClassification`]. Specify the number of labels along with the number of expected labels, and the label mappings:
Steven Liu's avatar
Steven Liu committed
278
279
280
281
282

```py
>>> from transformers import AutoModelForImageClassification, TrainingArguments, Trainer

>>> model = AutoModelForImageClassification.from_pretrained(
283
...     checkpoint,
Steven Liu's avatar
Steven Liu committed
284
285
286
287
288
289
290
291
...     num_labels=len(labels),
...     id2label=id2label,
...     label2id=label2id,
... )
```

At this point, only three steps remain:

292
293
294
1. Define your training hyperparameters in [`TrainingArguments`]. It is important you don't remove unused columns because this'll drop the `image` column. Without the `image` column, you can't create `pixel_values`. Set `remove_unused_columns=False` to prevent this behavior! The only other required parameter is `output_dir` which specifies where to save your model. You'll push this model to the Hub by setting `push_to_hub=True` (you need to be signed in to Hugging Face to upload your model). At the end of each epoch, the [`Trainer`] will evaluate the accuracy and save the training checkpoint.
2. Pass the training arguments to [`Trainer`] along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [`~Trainer.train`] to finetune your model.
Steven Liu's avatar
Steven Liu committed
295
296
297

```py
>>> training_args = TrainingArguments(
298
299
300
301
302
...     output_dir="my_awesome_food_model",
...     remove_unused_columns=False,
...     evaluation_strategy="epoch",
...     save_strategy="epoch",
...     learning_rate=5e-5,
Steven Liu's avatar
Steven Liu committed
303
...     per_device_train_batch_size=16,
304
305
306
307
...     gradient_accumulation_steps=4,
...     per_device_eval_batch_size=16,
...     num_train_epochs=3,
...     warmup_ratio=0.1,
Steven Liu's avatar
Steven Liu committed
308
...     logging_steps=10,
309
310
311
...     load_best_model_at_end=True,
...     metric_for_best_model="accuracy",
...     push_to_hub=True,
Steven Liu's avatar
Steven Liu committed
312
313
314
315
316
317
318
319
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     data_collator=data_collator,
...     train_dataset=food["train"],
...     eval_dataset=food["test"],
320
...     tokenizer=image_processor,
321
...     compute_metrics=compute_metrics,
Steven Liu's avatar
Steven Liu committed
322
323
324
325
... )

>>> trainer.train()
```
326
327
328
329
330
331

Once training is completed, share your model to the Hub with the [`~transformers.Trainer.push_to_hub`] method so everyone can use your model:

```py
>>> trainer.push_to_hub()
```
332
333
</pt>
</frameworkcontent>
Steven Liu's avatar
Steven Liu committed
334

335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
<frameworkcontent>
<tf>

<Tip>

If you are unfamiliar with fine-tuning a model with Keras, check out the [basic tutorial](./training#train-a-tensorflow-model-with-keras) first!

</Tip>

To fine-tune a model in TensorFlow, follow these steps:
1. Define the training hyperparameters, and set up an optimizer and a learning rate schedule.
2. Instantiate a pre-treined model.
3. Convert a 馃 Dataset to a `tf.data.Dataset`.
4. Compile your model.
5. Add callbacks and use the `fit()` method to run the training.
6. Upload your model to 馃 Hub to share with the community.

Start by defining the hyperparameters, optimizer and learning rate schedule:

```py
>>> from transformers import create_optimizer

>>> batch_size = 16
>>> num_epochs = 5
>>> num_train_steps = len(food["train"]) * num_epochs
>>> learning_rate = 3e-5
>>> weight_decay_rate = 0.01

>>> optimizer, lr_schedule = create_optimizer(
...     init_lr=learning_rate,
...     num_train_steps=num_train_steps,
...     weight_decay_rate=weight_decay_rate,
...     num_warmup_steps=0,
... )
```

Then, load ViT with [`TFAutoModelForImageClassification`] along with the label mappings:

```py
>>> from transformers import TFAutoModelForImageClassification

>>> model = TFAutoModelForImageClassification.from_pretrained(
...     checkpoint,
...     id2label=id2label,
...     label2id=label2id,
... )
```

Convert your datasets to the `tf.data.Dataset` format using the [`~datasets.Dataset.to_tf_dataset`] and your `data_collator`:

```py
>>> # converting our train dataset to tf.data.Dataset
>>> tf_train_dataset = food["train"].to_tf_dataset(
...     columns=["pixel_values"], label_cols=["label"], shuffle=True, batch_size=batch_size, collate_fn=data_collator
... )

>>> # converting our test dataset to tf.data.Dataset
>>> tf_eval_dataset = food["test"].to_tf_dataset(
...     columns=["pixel_values"], label_cols=["label"], shuffle=True, batch_size=batch_size, collate_fn=data_collator
... )
```

Configure the model for training with `compile()`:

```py
>>> from tensorflow.keras.losses import SparseCategoricalCrossentropy

>>> loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
>>> model.compile(optimizer=optimizer, loss=loss)
```

amyeroberts's avatar
amyeroberts committed
406
407
408
To compute the accuracy from the predictions and push your model to the 馃 Hub, use [Keras callbacks](../main_classes/keras_callbacks).
Pass your `compute_metrics` function to [KerasMetricCallback](../main_classes/keras_callbacks#transformers.KerasMetricCallback),
and use the [PushToHubCallback](../main_classes/keras_callbacks#transformers.PushToHubCallback) to upload the model:
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443

```py
>>> from transformers.keras_callbacks import KerasMetricCallback, PushToHubCallback

>>> metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_eval_dataset)
>>> push_to_hub_callback = PushToHubCallback(
...     output_dir="food_classifier",
...     tokenizer=image_processor,
...     save_strategy="no",
... )
>>> callbacks = [metric_callback, push_to_hub_callback]
```

Finally, you are ready to train your model! Call `fit()` with your training and validation datasets, the number of epochs,
and your callbacks to fine-tune the model:

```py
>>> model.fit(tf_train_dataset, validation_data=tf_eval_dataset, epochs=num_epochs, callbacks=callbacks)
Epoch 1/5
250/250 [==============================] - 313s 1s/step - loss: 2.5623 - val_loss: 1.4161 - accuracy: 0.9290
Epoch 2/5
250/250 [==============================] - 265s 1s/step - loss: 0.9181 - val_loss: 0.6808 - accuracy: 0.9690
Epoch 3/5
250/250 [==============================] - 252s 1s/step - loss: 0.3910 - val_loss: 0.4303 - accuracy: 0.9820
Epoch 4/5
250/250 [==============================] - 251s 1s/step - loss: 0.2028 - val_loss: 0.3191 - accuracy: 0.9900
Epoch 5/5
250/250 [==============================] - 238s 949ms/step - loss: 0.1232 - val_loss: 0.3259 - accuracy: 0.9890
```

Congratulations! You have fine-tuned your model and shared it on the 馃 Hub. You can now use it for inference!
</tf>
</frameworkcontent>


Steven Liu's avatar
Steven Liu committed
444
445
<Tip>

446
For a more in-depth example of how to finetune a model for image classification, take a look at the corresponding [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
Steven Liu's avatar
Steven Liu committed
447

448
</Tip>
449
450
451

## Inference

452
Great, now that you've fine-tuned a model, you can use it for inference!
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471

Load an image you'd like to run inference on:

```py
>>> ds = load_dataset("food101", split="validation[:10]")
>>> image = ds["image"][0]
```

<div class="flex justify-center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png" alt="image of beignets"/>
</div>

The simplest way to try out your finetuned model for inference is to use it in a [`pipeline`]. Instantiate a `pipeline` for image classification with your model, and pass your image to it:

```py
>>> from transformers import pipeline

>>> classifier = pipeline("image-classification", model="my_awesome_food_model")
>>> classifier(image)
472
473
474
475
476
[{'score': 0.31856709718704224, 'label': 'beignets'},
 {'score': 0.015232225880026817, 'label': 'bruschetta'},
 {'score': 0.01519392803311348, 'label': 'chicken_wings'},
 {'score': 0.013022331520915031, 'label': 'pork_chop'},
 {'score': 0.012728818692266941, 'label': 'prime_rib'}]
477
```
478

479
480
481
482
You can also manually replicate the results of the `pipeline` if you'd like:

<frameworkcontent>
<pt>
483
Load an image processor to preprocess the image and return the `input` as PyTorch tensors:
484
485

```py
486
>>> from transformers import AutoImageProcessor
487
488
>>> import torch

489
490
>>> image_processor = AutoImageProcessor.from_pretrained("my_awesome_food_model")
>>> inputs = image_processor(image, return_tensors="pt")
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
```

Pass your inputs to the model and return the logits:

```py
>>> from transformers import AutoModelForImageClassification

>>> model = AutoModelForImageClassification.from_pretrained("my_awesome_food_model")
>>> with torch.no_grad():
...     logits = model(**inputs).logits
```

Get the predicted label with the highest probability, and use the model's `id2label` mapping to convert it to a label:

```py
>>> predicted_label = logits.argmax(-1).item()
>>> model.config.id2label[predicted_label]
'beignets'
```
</pt>
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
</frameworkcontent>

<frameworkcontent>
<tf>
Load an image processor to preprocess the image and return the `input` as TensorFlow tensors:

```py
>>> from transformers import AutoImageProcessor

>>> image_processor = AutoImageProcessor.from_pretrained("MariaK/food_classifier")
>>> inputs = image_processor(image, return_tensors="tf")
```

Pass your inputs to the model and return the logits:

```py
>>> from transformers import TFAutoModelForImageClassification

>>> model = TFAutoModelForImageClassification.from_pretrained("MariaK/food_classifier")
>>> logits = model(**inputs).logits
```

Get the predicted label with the highest probability, and use the model's `id2label` mapping to convert it to a label:

```py
>>> predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0])
>>> model.config.id2label[predicted_class_id]
'beignets'
```

</tf>
</frameworkcontent>