image_classification.md 19.7 KB
Newer Older
Steven Liu's avatar
Steven Liu committed
1
2
3
4
5
6
7
8
9
10
<!--Copyright 2022 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
11
12
13
14

鈿狅笍 Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

Steven Liu's avatar
Steven Liu committed
15
16
17
18
-->

# Image classification

19
20
[[open-in-colab]]

Steven Liu's avatar
Steven Liu committed
21
22
<Youtube id="tjAIM7BOYhw"/>

23
24
25
Image classification assigns a label or class to an image. Unlike text or audio classification, the inputs are the
pixel values that comprise an image. There are many applications for image classification, such as detecting damage
after a natural disaster, monitoring crop health, or helping screen medical images for signs of disease.
26

27
This guide illustrates how to:
Steven Liu's avatar
Steven Liu committed
28

29
1. Fine-tune [ViT](model_doc/vit) on the [Food-101](https://huggingface.co/datasets/food101) dataset to classify a food item in an image.
30
2. Use your fine-tuned model for inference.
Steven Liu's avatar
Steven Liu committed
31
32

<Tip>
33
The task illustrated in this tutorial is supported by the following model architectures:
Steven Liu's avatar
Steven Liu committed
34

35
36
<!--This tip is automatically generated by `make fix-copies`, do not fill manually!-->

Nate Cibik's avatar
Nate Cibik committed
37
[BEiT](../model_doc/beit), [BiT](../model_doc/bit), [CLIP](../model_doc/clip), [ConvNeXT](../model_doc/convnext), [ConvNeXTV2](../model_doc/convnextv2), [CvT](../model_doc/cvt), [Data2VecVision](../model_doc/data2vec-vision), [DeiT](../model_doc/deit), [DiNAT](../model_doc/dinat), [DINOv2](../model_doc/dinov2), [EfficientFormer](../model_doc/efficientformer), [EfficientNet](../model_doc/efficientnet), [FocalNet](../model_doc/focalnet), [ImageGPT](../model_doc/imagegpt), [LeViT](../model_doc/levit), [MobileNetV1](../model_doc/mobilenet_v1), [MobileNetV2](../model_doc/mobilenet_v2), [MobileViT](../model_doc/mobilevit), [MobileViTV2](../model_doc/mobilevitv2), [NAT](../model_doc/nat), [Perceiver](../model_doc/perceiver), [PoolFormer](../model_doc/poolformer), [PVT](../model_doc/pvt), [PVTv2](../model_doc/pvt_v2), [RegNet](../model_doc/regnet), [ResNet](../model_doc/resnet), [SegFormer](../model_doc/segformer), [SigLIP](../model_doc/siglip), [SwiftFormer](../model_doc/swiftformer), [Swin Transformer](../model_doc/swin), [Swin Transformer V2](../model_doc/swinv2), [VAN](../model_doc/van), [ViT](../model_doc/vit), [ViT Hybrid](../model_doc/vit_hybrid), [ViTMSN](../model_doc/vit_msn)
Rinat's avatar
Rinat committed
38

39
<!--End of the generated tip-->
Steven Liu's avatar
Steven Liu committed
40
41
42

</Tip>

43
44
45
Before you begin, make sure you have all the necessary libraries installed:

```bash
46
pip install transformers datasets evaluate accelerate
47
48
```

49
We encourage you to log in to your Hugging Face account to upload and share your model with the community. When prompted, enter your token to log in:
50
51
52
53
54
55
56

```py
>>> from huggingface_hub import notebook_login

>>> notebook_login()
```

Steven Liu's avatar
Steven Liu committed
57
58
## Load Food-101 dataset

59
60
Start by loading a smaller subset of the Food-101 dataset from the 馃 Datasets library. This will give you a chance to
experiment and make sure everything works before spending more time training on the full dataset.
Steven Liu's avatar
Steven Liu committed
61
62
63
64
65
66
67

```py
>>> from datasets import load_dataset

>>> food = load_dataset("food101", split="train[:5000]")
```

68
Split the dataset's `train` split into a train and test set with the [`~datasets.Dataset.train_test_split`] method:
Steven Liu's avatar
Steven Liu committed
69
70
71
72
73
74
75
76
77
78
79
80
81

```py
>>> food = food.train_test_split(test_size=0.2)
```

Then take a look at an example:

```py
>>> food["train"][0]
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F52AFC8AC50>,
 'label': 79}
```

82
Each example in the dataset has two fields:
83

84
85
- `image`: a PIL image of the food item
- `label`: the label class of the food item
86

87
88
To make it easier for the model to get the label name from the label id, create a dictionary that maps the label name
to an integer and vice versa:
Steven Liu's avatar
Steven Liu committed
89
90
91
92
93
94
95
96
97

```py
>>> labels = food["train"].features["label"].names
>>> label2id, id2label = dict(), dict()
>>> for i, label in enumerate(labels):
...     label2id[label] = str(i)
...     id2label[str(i)] = label
```

98
Now you can convert the label id to a label name:
Steven Liu's avatar
Steven Liu committed
99
100
101
102
103
104
105
106

```py
>>> id2label[str(79)]
'prime_rib'
```

## Preprocess

107
The next step is to load a ViT image processor to process the image into a tensor:
Steven Liu's avatar
Steven Liu committed
108
109

```py
110
>>> from transformers import AutoImageProcessor
Steven Liu's avatar
Steven Liu committed
111

112
113
>>> checkpoint = "google/vit-base-patch16-224-in21k"
>>> image_processor = AutoImageProcessor.from_pretrained(checkpoint)
Steven Liu's avatar
Steven Liu committed
114
115
```

116
117
<frameworkcontent>
<pt>
118
Apply some image transformations to the images to make the model more robust against overfitting. Here you'll use torchvision's [`transforms`](https://pytorch.org/vision/stable/transforms.html) module, but you can also use any image library you like.
119
120

Crop a random part of the image, resize it, and normalize it with the image mean and standard deviation:
Steven Liu's avatar
Steven Liu committed
121
122
123
124

```py
>>> from torchvision.transforms import RandomResizedCrop, Compose, Normalize, ToTensor

125
>>> normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std)
amyeroberts's avatar
amyeroberts committed
126
>>> size = (
127
128
129
...     image_processor.size["shortest_edge"]
...     if "shortest_edge" in image_processor.size
...     else (image_processor.size["height"], image_processor.size["width"])
amyeroberts's avatar
amyeroberts committed
130
131
... )
>>> _transforms = Compose([RandomResizedCrop(size), ToTensor(), normalize])
Steven Liu's avatar
Steven Liu committed
132
133
```

134
Then create a preprocessing function to apply the transforms and return the `pixel_values` - the inputs to the model - of the image:
Steven Liu's avatar
Steven Liu committed
135
136
137
138
139
140
141
142

```py
>>> def transforms(examples):
...     examples["pixel_values"] = [_transforms(img.convert("RGB")) for img in examples["image"]]
...     del examples["image"]
...     return examples
```

143
To apply the preprocessing function over the entire dataset, use 馃 Datasets [`~datasets.Dataset.with_transform`] method. The transforms are applied on the fly when you load an element of the dataset:
Steven Liu's avatar
Steven Liu committed
144
145
146
147
148

```py
>>> food = food.with_transform(transforms)
```

149
Now create a batch of examples using [`DefaultDataCollator`]. Unlike other data collators in 馃 Transformers, the `DefaultDataCollator` does not apply additional preprocessing such as padding.
Steven Liu's avatar
Steven Liu committed
150
151
152
153
154
155

```py
>>> from transformers import DefaultDataCollator

>>> data_collator = DefaultDataCollator()
```
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
</pt>
</frameworkcontent>


<frameworkcontent>
<tf>

To avoid overfitting and to make the model more robust, add some data augmentation to the training part of the dataset.
Here we use Keras preprocessing layers to define the transformations for the training data (includes data augmentation),
and transformations for the validation data (only center cropping, resizing and normalizing). You can use `tf.image`or
any other library you prefer.

```py
>>> from tensorflow import keras
>>> from tensorflow.keras import layers

>>> size = (image_processor.size["height"], image_processor.size["width"])

>>> train_data_augmentation = keras.Sequential(
...     [
...         layers.RandomCrop(size[0], size[1]),
...         layers.Rescaling(scale=1.0 / 127.5, offset=-1),
...         layers.RandomFlip("horizontal"),
...         layers.RandomRotation(factor=0.02),
...         layers.RandomZoom(height_factor=0.2, width_factor=0.2),
...     ],
...     name="train_data_augmentation",
... )

>>> val_data_augmentation = keras.Sequential(
...     [
...         layers.CenterCrop(size[0], size[1]),
...         layers.Rescaling(scale=1.0 / 127.5, offset=-1),
...     ],
...     name="val_data_augmentation",
... )
```

Next, create functions to apply appropriate transformations to a batch of images, instead of one image at a time.

```py
>>> import numpy as np
>>> import tensorflow as tf
>>> from PIL import Image


>>> def convert_to_tf_tensor(image: Image):
...     np_image = np.array(image)
...     tf_image = tf.convert_to_tensor(np_image)
...     # `expand_dims()` is used to add a batch dimension since
...     # the TF augmentation layers operates on batched inputs.
...     return tf.expand_dims(tf_image, 0)


>>> def preprocess_train(example_batch):
...     """Apply train_transforms across a batch."""
...     images = [
...         train_data_augmentation(convert_to_tf_tensor(image.convert("RGB"))) for image in example_batch["image"]
...     ]
...     example_batch["pixel_values"] = [tf.transpose(tf.squeeze(image)) for image in images]
...     return example_batch


... def preprocess_val(example_batch):
...     """Apply val_transforms across a batch."""
...     images = [
...         val_data_augmentation(convert_to_tf_tensor(image.convert("RGB"))) for image in example_batch["image"]
...     ]
...     example_batch["pixel_values"] = [tf.transpose(tf.squeeze(image)) for image in images]
...     return example_batch
```

Use 馃 Datasets [`~datasets.Dataset.set_transform`] to apply the transformations on the fly:

```py
food["train"].set_transform(preprocess_train)
food["test"].set_transform(preprocess_val)
```

As a final preprocessing step, create a batch of examples using `DefaultDataCollator`. Unlike other data collators in 馃 Transformers, the
`DefaultDataCollator` does not apply additional preprocessing, such as padding.

```py
>>> from transformers import DefaultDataCollator

>>> data_collator = DefaultDataCollator(return_tensors="tf")
```
</tf>
</frameworkcontent>
Steven Liu's avatar
Steven Liu committed
245

246
247
## Evaluate

248
249
250
Including a metric during training is often helpful for evaluating your model's performance. You can quickly load an
evaluation method with the 馃 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load
the [accuracy](https://huggingface.co/spaces/evaluate-metric/accuracy) metric (see the 馃 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric):
251
252
253
254
255
256
257
258
259
260
261
262
263
264

```py
>>> import evaluate

>>> accuracy = evaluate.load("accuracy")
```

Then create a function that passes your predictions and labels to [`~evaluate.EvaluationModule.compute`] to calculate the accuracy:

```py
>>> import numpy as np


>>> def compute_metrics(eval_pred):
265
266
267
...     predictions, labels = eval_pred
...     predictions = np.argmax(predictions, axis=1)
...     return accuracy.compute(predictions=predictions, references=labels)
268
269
```

270
Your `compute_metrics` function is ready to go now, and you'll return to it when you set up your training.
271

272
## Train
Steven Liu's avatar
Steven Liu committed
273

274
275
<frameworkcontent>
<pt>
276
277
278
279
280
<Tip>

If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#train-with-pytorch-trainer)!

</Tip>
281

282
You're ready to start training your model now! Load ViT with [`AutoModelForImageClassification`]. Specify the number of labels along with the number of expected labels, and the label mappings:
Steven Liu's avatar
Steven Liu committed
283
284
285
286
287

```py
>>> from transformers import AutoModelForImageClassification, TrainingArguments, Trainer

>>> model = AutoModelForImageClassification.from_pretrained(
288
...     checkpoint,
Steven Liu's avatar
Steven Liu committed
289
290
291
292
293
294
295
296
...     num_labels=len(labels),
...     id2label=id2label,
...     label2id=label2id,
... )
```

At this point, only three steps remain:

297
1. Define your training hyperparameters in [`TrainingArguments`]. It is important you don't remove unused columns because that'll drop the `image` column. Without the `image` column, you can't create `pixel_values`. Set `remove_unused_columns=False` to prevent this behavior! The only other required parameter is `output_dir` which specifies where to save your model. You'll push this model to the Hub by setting `push_to_hub=True` (you need to be signed in to Hugging Face to upload your model). At the end of each epoch, the [`Trainer`] will evaluate the accuracy and save the training checkpoint.
298
299
2. Pass the training arguments to [`Trainer`] along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [`~Trainer.train`] to finetune your model.
Steven Liu's avatar
Steven Liu committed
300
301
302

```py
>>> training_args = TrainingArguments(
303
304
305
306
307
...     output_dir="my_awesome_food_model",
...     remove_unused_columns=False,
...     evaluation_strategy="epoch",
...     save_strategy="epoch",
...     learning_rate=5e-5,
Steven Liu's avatar
Steven Liu committed
308
...     per_device_train_batch_size=16,
309
310
311
312
...     gradient_accumulation_steps=4,
...     per_device_eval_batch_size=16,
...     num_train_epochs=3,
...     warmup_ratio=0.1,
Steven Liu's avatar
Steven Liu committed
313
...     logging_steps=10,
314
315
316
...     load_best_model_at_end=True,
...     metric_for_best_model="accuracy",
...     push_to_hub=True,
Steven Liu's avatar
Steven Liu committed
317
318
319
320
321
322
323
324
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     data_collator=data_collator,
...     train_dataset=food["train"],
...     eval_dataset=food["test"],
NielsRogge's avatar
NielsRogge committed
325
...     tokenizer=image_processor,
326
...     compute_metrics=compute_metrics,
Steven Liu's avatar
Steven Liu committed
327
328
329
330
... )

>>> trainer.train()
```
331
332
333
334
335
336

Once training is completed, share your model to the Hub with the [`~transformers.Trainer.push_to_hub`] method so everyone can use your model:

```py
>>> trainer.push_to_hub()
```
337
338
</pt>
</frameworkcontent>
Steven Liu's avatar
Steven Liu committed
339

340
341
342
343
344
345
346
347
348
349
350
<frameworkcontent>
<tf>

<Tip>

If you are unfamiliar with fine-tuning a model with Keras, check out the [basic tutorial](./training#train-a-tensorflow-model-with-keras) first!

</Tip>

To fine-tune a model in TensorFlow, follow these steps:
1. Define the training hyperparameters, and set up an optimizer and a learning rate schedule.
351
2. Instantiate a pre-trained model.
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
3. Convert a 馃 Dataset to a `tf.data.Dataset`.
4. Compile your model.
5. Add callbacks and use the `fit()` method to run the training.
6. Upload your model to 馃 Hub to share with the community.

Start by defining the hyperparameters, optimizer and learning rate schedule:

```py
>>> from transformers import create_optimizer

>>> batch_size = 16
>>> num_epochs = 5
>>> num_train_steps = len(food["train"]) * num_epochs
>>> learning_rate = 3e-5
>>> weight_decay_rate = 0.01

>>> optimizer, lr_schedule = create_optimizer(
...     init_lr=learning_rate,
...     num_train_steps=num_train_steps,
...     weight_decay_rate=weight_decay_rate,
...     num_warmup_steps=0,
... )
```

Then, load ViT with [`TFAutoModelForImageClassification`] along with the label mappings:

```py
>>> from transformers import TFAutoModelForImageClassification

>>> model = TFAutoModelForImageClassification.from_pretrained(
...     checkpoint,
...     id2label=id2label,
...     label2id=label2id,
... )
```

Convert your datasets to the `tf.data.Dataset` format using the [`~datasets.Dataset.to_tf_dataset`] and your `data_collator`:

```py
>>> # converting our train dataset to tf.data.Dataset
>>> tf_train_dataset = food["train"].to_tf_dataset(
393
...     columns="pixel_values", label_cols="label", shuffle=True, batch_size=batch_size, collate_fn=data_collator
394
395
396
397
... )

>>> # converting our test dataset to tf.data.Dataset
>>> tf_eval_dataset = food["test"].to_tf_dataset(
398
...     columns="pixel_values", label_cols="label", shuffle=True, batch_size=batch_size, collate_fn=data_collator
399
400
401
402
403
404
405
406
407
408
409
410
... )
```

Configure the model for training with `compile()`:

```py
>>> from tensorflow.keras.losses import SparseCategoricalCrossentropy

>>> loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
>>> model.compile(optimizer=optimizer, loss=loss)
```

amyeroberts's avatar
amyeroberts committed
411
412
413
To compute the accuracy from the predictions and push your model to the 馃 Hub, use [Keras callbacks](../main_classes/keras_callbacks).
Pass your `compute_metrics` function to [KerasMetricCallback](../main_classes/keras_callbacks#transformers.KerasMetricCallback),
and use the [PushToHubCallback](../main_classes/keras_callbacks#transformers.PushToHubCallback) to upload the model:
414
415
416
417
418
419
420

```py
>>> from transformers.keras_callbacks import KerasMetricCallback, PushToHubCallback

>>> metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_eval_dataset)
>>> push_to_hub_callback = PushToHubCallback(
...     output_dir="food_classifier",
NielsRogge's avatar
NielsRogge committed
421
...     tokenizer=image_processor,
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
...     save_strategy="no",
... )
>>> callbacks = [metric_callback, push_to_hub_callback]
```

Finally, you are ready to train your model! Call `fit()` with your training and validation datasets, the number of epochs,
and your callbacks to fine-tune the model:

```py
>>> model.fit(tf_train_dataset, validation_data=tf_eval_dataset, epochs=num_epochs, callbacks=callbacks)
Epoch 1/5
250/250 [==============================] - 313s 1s/step - loss: 2.5623 - val_loss: 1.4161 - accuracy: 0.9290
Epoch 2/5
250/250 [==============================] - 265s 1s/step - loss: 0.9181 - val_loss: 0.6808 - accuracy: 0.9690
Epoch 3/5
250/250 [==============================] - 252s 1s/step - loss: 0.3910 - val_loss: 0.4303 - accuracy: 0.9820
Epoch 4/5
250/250 [==============================] - 251s 1s/step - loss: 0.2028 - val_loss: 0.3191 - accuracy: 0.9900
Epoch 5/5
250/250 [==============================] - 238s 949ms/step - loss: 0.1232 - val_loss: 0.3259 - accuracy: 0.9890
```

Congratulations! You have fine-tuned your model and shared it on the 馃 Hub. You can now use it for inference!
</tf>
</frameworkcontent>


Steven Liu's avatar
Steven Liu committed
449
450
<Tip>

451
For a more in-depth example of how to finetune a model for image classification, take a look at the corresponding [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
Steven Liu's avatar
Steven Liu committed
452

453
</Tip>
454
455
456

## Inference

457
Great, now that you've fine-tuned a model, you can use it for inference!
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476

Load an image you'd like to run inference on:

```py
>>> ds = load_dataset("food101", split="validation[:10]")
>>> image = ds["image"][0]
```

<div class="flex justify-center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png" alt="image of beignets"/>
</div>

The simplest way to try out your finetuned model for inference is to use it in a [`pipeline`]. Instantiate a `pipeline` for image classification with your model, and pass your image to it:

```py
>>> from transformers import pipeline

>>> classifier = pipeline("image-classification", model="my_awesome_food_model")
>>> classifier(image)
477
478
479
480
481
[{'score': 0.31856709718704224, 'label': 'beignets'},
 {'score': 0.015232225880026817, 'label': 'bruschetta'},
 {'score': 0.01519392803311348, 'label': 'chicken_wings'},
 {'score': 0.013022331520915031, 'label': 'pork_chop'},
 {'score': 0.012728818692266941, 'label': 'prime_rib'}]
482
```
483

484
485
486
487
You can also manually replicate the results of the `pipeline` if you'd like:

<frameworkcontent>
<pt>
488
Load an image processor to preprocess the image and return the `input` as PyTorch tensors:
489
490

```py
491
>>> from transformers import AutoImageProcessor
492
493
>>> import torch

494
495
>>> image_processor = AutoImageProcessor.from_pretrained("my_awesome_food_model")
>>> inputs = image_processor(image, return_tensors="pt")
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
```

Pass your inputs to the model and return the logits:

```py
>>> from transformers import AutoModelForImageClassification

>>> model = AutoModelForImageClassification.from_pretrained("my_awesome_food_model")
>>> with torch.no_grad():
...     logits = model(**inputs).logits
```

Get the predicted label with the highest probability, and use the model's `id2label` mapping to convert it to a label:

```py
>>> predicted_label = logits.argmax(-1).item()
>>> model.config.id2label[predicted_label]
'beignets'
```
</pt>
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
</frameworkcontent>

<frameworkcontent>
<tf>
Load an image processor to preprocess the image and return the `input` as TensorFlow tensors:

```py
>>> from transformers import AutoImageProcessor

>>> image_processor = AutoImageProcessor.from_pretrained("MariaK/food_classifier")
>>> inputs = image_processor(image, return_tensors="tf")
```

Pass your inputs to the model and return the logits:

```py
>>> from transformers import TFAutoModelForImageClassification

>>> model = TFAutoModelForImageClassification.from_pretrained("MariaK/food_classifier")
>>> logits = model(**inputs).logits
```

Get the predicted label with the highest probability, and use the model's `id2label` mapping to convert it to a label:

```py
>>> predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0])
>>> model.config.id2label[predicted_class_id]
'beignets'
```

</tf>
</frameworkcontent>