image_classification.md 19.5 KB
Newer Older
Steven Liu's avatar
Steven Liu committed
1
2
3
4
5
6
7
8
9
10
<!--Copyright 2022 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
11
12
13
14

鈿狅笍 Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

Steven Liu's avatar
Steven Liu committed
15
16
17
18
-->

# Image classification

19
20
[[open-in-colab]]

Steven Liu's avatar
Steven Liu committed
21
22
<Youtube id="tjAIM7BOYhw"/>

23
24
25
Image classification assigns a label or class to an image. Unlike text or audio classification, the inputs are the
pixel values that comprise an image. There are many applications for image classification, such as detecting damage
after a natural disaster, monitoring crop health, or helping screen medical images for signs of disease.
26

27
This guide illustrates how to:
Steven Liu's avatar
Steven Liu committed
28

29
1. Fine-tune [ViT](model_doc/vit) on the [Food-101](https://huggingface.co/datasets/food101) dataset to classify a food item in an image.
30
2. Use your fine-tuned model for inference.
Steven Liu's avatar
Steven Liu committed
31
32

<Tip>
33
The task illustrated in this tutorial is supported by the following model architectures:
Steven Liu's avatar
Steven Liu committed
34

35
36
<!--This tip is automatically generated by `make fix-copies`, do not fill manually!-->

NielsRogge's avatar
NielsRogge committed
37
[BEiT](../model_doc/beit), [BiT](../model_doc/bit), [ConvNeXT](../model_doc/convnext), [ConvNeXTV2](../model_doc/convnextv2), [CvT](../model_doc/cvt), [Data2VecVision](../model_doc/data2vec-vision), [DeiT](../model_doc/deit), [DiNAT](../model_doc/dinat), [DINOv2](../model_doc/dinov2), [EfficientFormer](../model_doc/efficientformer), [EfficientNet](../model_doc/efficientnet), [FocalNet](../model_doc/focalnet), [ImageGPT](../model_doc/imagegpt), [LeViT](../model_doc/levit), [MobileNetV1](../model_doc/mobilenet_v1), [MobileNetV2](../model_doc/mobilenet_v2), [MobileViT](../model_doc/mobilevit), [MobileViTV2](../model_doc/mobilevitv2), [NAT](../model_doc/nat), [Perceiver](../model_doc/perceiver), [PoolFormer](../model_doc/poolformer), [RegNet](../model_doc/regnet), [ResNet](../model_doc/resnet), [SegFormer](../model_doc/segformer), [SwiftFormer](../model_doc/swiftformer), [Swin Transformer](../model_doc/swin), [Swin Transformer V2](../model_doc/swinv2), [VAN](../model_doc/van), [ViT](../model_doc/vit), [ViT Hybrid](../model_doc/vit_hybrid), [ViTMSN](../model_doc/vit_msn)
38
<!--End of the generated tip-->
Steven Liu's avatar
Steven Liu committed
39
40
41

</Tip>

42
43
44
45
46
47
Before you begin, make sure you have all the necessary libraries installed:

```bash
pip install transformers datasets evaluate
```

48
We encourage you to log in to your Hugging Face account to upload and share your model with the community. When prompted, enter your token to log in:
49
50
51
52
53
54
55

```py
>>> from huggingface_hub import notebook_login

>>> notebook_login()
```

Steven Liu's avatar
Steven Liu committed
56
57
## Load Food-101 dataset

58
59
Start by loading a smaller subset of the Food-101 dataset from the 馃 Datasets library. This will give you a chance to
experiment and make sure everything works before spending more time training on the full dataset.
Steven Liu's avatar
Steven Liu committed
60
61
62
63
64
65
66

```py
>>> from datasets import load_dataset

>>> food = load_dataset("food101", split="train[:5000]")
```

67
Split the dataset's `train` split into a train and test set with the [`~datasets.Dataset.train_test_split`] method:
Steven Liu's avatar
Steven Liu committed
68
69
70
71
72
73
74
75
76
77
78
79
80

```py
>>> food = food.train_test_split(test_size=0.2)
```

Then take a look at an example:

```py
>>> food["train"][0]
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F52AFC8AC50>,
 'label': 79}
```

81
Each example in the dataset has two fields:
82

83
84
- `image`: a PIL image of the food item
- `label`: the label class of the food item
85

86
87
To make it easier for the model to get the label name from the label id, create a dictionary that maps the label name
to an integer and vice versa:
Steven Liu's avatar
Steven Liu committed
88
89
90
91
92
93
94
95
96

```py
>>> labels = food["train"].features["label"].names
>>> label2id, id2label = dict(), dict()
>>> for i, label in enumerate(labels):
...     label2id[label] = str(i)
...     id2label[str(i)] = label
```

97
Now you can convert the label id to a label name:
Steven Liu's avatar
Steven Liu committed
98
99
100
101
102
103
104
105

```py
>>> id2label[str(79)]
'prime_rib'
```

## Preprocess

106
The next step is to load a ViT image processor to process the image into a tensor:
Steven Liu's avatar
Steven Liu committed
107
108

```py
109
>>> from transformers import AutoImageProcessor
Steven Liu's avatar
Steven Liu committed
110

111
112
>>> checkpoint = "google/vit-base-patch16-224-in21k"
>>> image_processor = AutoImageProcessor.from_pretrained(checkpoint)
Steven Liu's avatar
Steven Liu committed
113
114
```

115
116
<frameworkcontent>
<pt>
117
Apply some image transformations to the images to make the model more robust against overfitting. Here you'll use torchvision's [`transforms`](https://pytorch.org/vision/stable/transforms.html) module, but you can also use any image library you like.
118
119

Crop a random part of the image, resize it, and normalize it with the image mean and standard deviation:
Steven Liu's avatar
Steven Liu committed
120
121
122
123

```py
>>> from torchvision.transforms import RandomResizedCrop, Compose, Normalize, ToTensor

124
>>> normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std)
amyeroberts's avatar
amyeroberts committed
125
>>> size = (
126
127
128
...     image_processor.size["shortest_edge"]
...     if "shortest_edge" in image_processor.size
...     else (image_processor.size["height"], image_processor.size["width"])
amyeroberts's avatar
amyeroberts committed
129
130
... )
>>> _transforms = Compose([RandomResizedCrop(size), ToTensor(), normalize])
Steven Liu's avatar
Steven Liu committed
131
132
```

133
Then create a preprocessing function to apply the transforms and return the `pixel_values` - the inputs to the model - of the image:
Steven Liu's avatar
Steven Liu committed
134
135
136
137
138
139
140
141

```py
>>> def transforms(examples):
...     examples["pixel_values"] = [_transforms(img.convert("RGB")) for img in examples["image"]]
...     del examples["image"]
...     return examples
```

142
To apply the preprocessing function over the entire dataset, use 馃 Datasets [`~datasets.Dataset.with_transform`] method. The transforms are applied on the fly when you load an element of the dataset:
Steven Liu's avatar
Steven Liu committed
143
144
145
146
147

```py
>>> food = food.with_transform(transforms)
```

148
Now create a batch of examples using [`DefaultDataCollator`]. Unlike other data collators in 馃 Transformers, the `DefaultDataCollator` does not apply additional preprocessing such as padding.
Steven Liu's avatar
Steven Liu committed
149
150
151
152
153
154

```py
>>> from transformers import DefaultDataCollator

>>> data_collator = DefaultDataCollator()
```
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
</pt>
</frameworkcontent>


<frameworkcontent>
<tf>

To avoid overfitting and to make the model more robust, add some data augmentation to the training part of the dataset.
Here we use Keras preprocessing layers to define the transformations for the training data (includes data augmentation),
and transformations for the validation data (only center cropping, resizing and normalizing). You can use `tf.image`or
any other library you prefer.

```py
>>> from tensorflow import keras
>>> from tensorflow.keras import layers

>>> size = (image_processor.size["height"], image_processor.size["width"])

>>> train_data_augmentation = keras.Sequential(
...     [
...         layers.RandomCrop(size[0], size[1]),
...         layers.Rescaling(scale=1.0 / 127.5, offset=-1),
...         layers.RandomFlip("horizontal"),
...         layers.RandomRotation(factor=0.02),
...         layers.RandomZoom(height_factor=0.2, width_factor=0.2),
...     ],
...     name="train_data_augmentation",
... )

>>> val_data_augmentation = keras.Sequential(
...     [
...         layers.CenterCrop(size[0], size[1]),
...         layers.Rescaling(scale=1.0 / 127.5, offset=-1),
...     ],
...     name="val_data_augmentation",
... )
```

Next, create functions to apply appropriate transformations to a batch of images, instead of one image at a time.

```py
>>> import numpy as np
>>> import tensorflow as tf
>>> from PIL import Image


>>> def convert_to_tf_tensor(image: Image):
...     np_image = np.array(image)
...     tf_image = tf.convert_to_tensor(np_image)
...     # `expand_dims()` is used to add a batch dimension since
...     # the TF augmentation layers operates on batched inputs.
...     return tf.expand_dims(tf_image, 0)


>>> def preprocess_train(example_batch):
...     """Apply train_transforms across a batch."""
...     images = [
...         train_data_augmentation(convert_to_tf_tensor(image.convert("RGB"))) for image in example_batch["image"]
...     ]
...     example_batch["pixel_values"] = [tf.transpose(tf.squeeze(image)) for image in images]
...     return example_batch


... def preprocess_val(example_batch):
...     """Apply val_transforms across a batch."""
...     images = [
...         val_data_augmentation(convert_to_tf_tensor(image.convert("RGB"))) for image in example_batch["image"]
...     ]
...     example_batch["pixel_values"] = [tf.transpose(tf.squeeze(image)) for image in images]
...     return example_batch
```

Use 馃 Datasets [`~datasets.Dataset.set_transform`] to apply the transformations on the fly:

```py
food["train"].set_transform(preprocess_train)
food["test"].set_transform(preprocess_val)
```

As a final preprocessing step, create a batch of examples using `DefaultDataCollator`. Unlike other data collators in 馃 Transformers, the
`DefaultDataCollator` does not apply additional preprocessing, such as padding.

```py
>>> from transformers import DefaultDataCollator

>>> data_collator = DefaultDataCollator(return_tensors="tf")
```
</tf>
</frameworkcontent>
Steven Liu's avatar
Steven Liu committed
244

245
246
## Evaluate

247
248
249
Including a metric during training is often helpful for evaluating your model's performance. You can quickly load an
evaluation method with the 馃 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load
the [accuracy](https://huggingface.co/spaces/evaluate-metric/accuracy) metric (see the 馃 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric):
250
251
252
253
254
255
256
257
258
259
260
261
262
263

```py
>>> import evaluate

>>> accuracy = evaluate.load("accuracy")
```

Then create a function that passes your predictions and labels to [`~evaluate.EvaluationModule.compute`] to calculate the accuracy:

```py
>>> import numpy as np


>>> def compute_metrics(eval_pred):
264
265
266
...     predictions, labels = eval_pred
...     predictions = np.argmax(predictions, axis=1)
...     return accuracy.compute(predictions=predictions, references=labels)
267
268
```

269
Your `compute_metrics` function is ready to go now, and you'll return to it when you set up your training.
270

271
## Train
Steven Liu's avatar
Steven Liu committed
272

273
274
<frameworkcontent>
<pt>
275
276
277
278
279
<Tip>

If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#train-with-pytorch-trainer)!

</Tip>
280

281
You're ready to start training your model now! Load ViT with [`AutoModelForImageClassification`]. Specify the number of labels along with the number of expected labels, and the label mappings:
Steven Liu's avatar
Steven Liu committed
282
283
284
285
286

```py
>>> from transformers import AutoModelForImageClassification, TrainingArguments, Trainer

>>> model = AutoModelForImageClassification.from_pretrained(
287
...     checkpoint,
Steven Liu's avatar
Steven Liu committed
288
289
290
291
292
293
294
295
...     num_labels=len(labels),
...     id2label=id2label,
...     label2id=label2id,
... )
```

At this point, only three steps remain:

296
1. Define your training hyperparameters in [`TrainingArguments`]. It is important you don't remove unused columns because that'll drop the `image` column. Without the `image` column, you can't create `pixel_values`. Set `remove_unused_columns=False` to prevent this behavior! The only other required parameter is `output_dir` which specifies where to save your model. You'll push this model to the Hub by setting `push_to_hub=True` (you need to be signed in to Hugging Face to upload your model). At the end of each epoch, the [`Trainer`] will evaluate the accuracy and save the training checkpoint.
297
298
2. Pass the training arguments to [`Trainer`] along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [`~Trainer.train`] to finetune your model.
Steven Liu's avatar
Steven Liu committed
299
300
301

```py
>>> training_args = TrainingArguments(
302
303
304
305
306
...     output_dir="my_awesome_food_model",
...     remove_unused_columns=False,
...     evaluation_strategy="epoch",
...     save_strategy="epoch",
...     learning_rate=5e-5,
Steven Liu's avatar
Steven Liu committed
307
...     per_device_train_batch_size=16,
308
309
310
311
...     gradient_accumulation_steps=4,
...     per_device_eval_batch_size=16,
...     num_train_epochs=3,
...     warmup_ratio=0.1,
Steven Liu's avatar
Steven Liu committed
312
...     logging_steps=10,
313
314
315
...     load_best_model_at_end=True,
...     metric_for_best_model="accuracy",
...     push_to_hub=True,
Steven Liu's avatar
Steven Liu committed
316
317
318
319
320
321
322
323
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     data_collator=data_collator,
...     train_dataset=food["train"],
...     eval_dataset=food["test"],
324
...     tokenizer=image_processor,
325
...     compute_metrics=compute_metrics,
Steven Liu's avatar
Steven Liu committed
326
327
328
329
... )

>>> trainer.train()
```
330
331
332
333
334
335

Once training is completed, share your model to the Hub with the [`~transformers.Trainer.push_to_hub`] method so everyone can use your model:

```py
>>> trainer.push_to_hub()
```
336
337
</pt>
</frameworkcontent>
Steven Liu's avatar
Steven Liu committed
338

339
340
341
342
343
344
345
346
347
348
349
<frameworkcontent>
<tf>

<Tip>

If you are unfamiliar with fine-tuning a model with Keras, check out the [basic tutorial](./training#train-a-tensorflow-model-with-keras) first!

</Tip>

To fine-tune a model in TensorFlow, follow these steps:
1. Define the training hyperparameters, and set up an optimizer and a learning rate schedule.
350
2. Instantiate a pre-trained model.
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
3. Convert a 馃 Dataset to a `tf.data.Dataset`.
4. Compile your model.
5. Add callbacks and use the `fit()` method to run the training.
6. Upload your model to 馃 Hub to share with the community.

Start by defining the hyperparameters, optimizer and learning rate schedule:

```py
>>> from transformers import create_optimizer

>>> batch_size = 16
>>> num_epochs = 5
>>> num_train_steps = len(food["train"]) * num_epochs
>>> learning_rate = 3e-5
>>> weight_decay_rate = 0.01

>>> optimizer, lr_schedule = create_optimizer(
...     init_lr=learning_rate,
...     num_train_steps=num_train_steps,
...     weight_decay_rate=weight_decay_rate,
...     num_warmup_steps=0,
... )
```

Then, load ViT with [`TFAutoModelForImageClassification`] along with the label mappings:

```py
>>> from transformers import TFAutoModelForImageClassification

>>> model = TFAutoModelForImageClassification.from_pretrained(
...     checkpoint,
...     id2label=id2label,
...     label2id=label2id,
... )
```

Convert your datasets to the `tf.data.Dataset` format using the [`~datasets.Dataset.to_tf_dataset`] and your `data_collator`:

```py
>>> # converting our train dataset to tf.data.Dataset
>>> tf_train_dataset = food["train"].to_tf_dataset(
392
...     columns="pixel_values", label_cols="label", shuffle=True, batch_size=batch_size, collate_fn=data_collator
393
394
395
396
... )

>>> # converting our test dataset to tf.data.Dataset
>>> tf_eval_dataset = food["test"].to_tf_dataset(
397
...     columns="pixel_values", label_cols="label", shuffle=True, batch_size=batch_size, collate_fn=data_collator
398
399
400
401
402
403
404
405
406
407
408
409
... )
```

Configure the model for training with `compile()`:

```py
>>> from tensorflow.keras.losses import SparseCategoricalCrossentropy

>>> loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
>>> model.compile(optimizer=optimizer, loss=loss)
```

amyeroberts's avatar
amyeroberts committed
410
411
412
To compute the accuracy from the predictions and push your model to the 馃 Hub, use [Keras callbacks](../main_classes/keras_callbacks).
Pass your `compute_metrics` function to [KerasMetricCallback](../main_classes/keras_callbacks#transformers.KerasMetricCallback),
and use the [PushToHubCallback](../main_classes/keras_callbacks#transformers.PushToHubCallback) to upload the model:
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447

```py
>>> from transformers.keras_callbacks import KerasMetricCallback, PushToHubCallback

>>> metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_eval_dataset)
>>> push_to_hub_callback = PushToHubCallback(
...     output_dir="food_classifier",
...     tokenizer=image_processor,
...     save_strategy="no",
... )
>>> callbacks = [metric_callback, push_to_hub_callback]
```

Finally, you are ready to train your model! Call `fit()` with your training and validation datasets, the number of epochs,
and your callbacks to fine-tune the model:

```py
>>> model.fit(tf_train_dataset, validation_data=tf_eval_dataset, epochs=num_epochs, callbacks=callbacks)
Epoch 1/5
250/250 [==============================] - 313s 1s/step - loss: 2.5623 - val_loss: 1.4161 - accuracy: 0.9290
Epoch 2/5
250/250 [==============================] - 265s 1s/step - loss: 0.9181 - val_loss: 0.6808 - accuracy: 0.9690
Epoch 3/5
250/250 [==============================] - 252s 1s/step - loss: 0.3910 - val_loss: 0.4303 - accuracy: 0.9820
Epoch 4/5
250/250 [==============================] - 251s 1s/step - loss: 0.2028 - val_loss: 0.3191 - accuracy: 0.9900
Epoch 5/5
250/250 [==============================] - 238s 949ms/step - loss: 0.1232 - val_loss: 0.3259 - accuracy: 0.9890
```

Congratulations! You have fine-tuned your model and shared it on the 馃 Hub. You can now use it for inference!
</tf>
</frameworkcontent>


Steven Liu's avatar
Steven Liu committed
448
449
<Tip>

450
For a more in-depth example of how to finetune a model for image classification, take a look at the corresponding [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
Steven Liu's avatar
Steven Liu committed
451

452
</Tip>
453
454
455

## Inference

456
Great, now that you've fine-tuned a model, you can use it for inference!
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475

Load an image you'd like to run inference on:

```py
>>> ds = load_dataset("food101", split="validation[:10]")
>>> image = ds["image"][0]
```

<div class="flex justify-center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png" alt="image of beignets"/>
</div>

The simplest way to try out your finetuned model for inference is to use it in a [`pipeline`]. Instantiate a `pipeline` for image classification with your model, and pass your image to it:

```py
>>> from transformers import pipeline

>>> classifier = pipeline("image-classification", model="my_awesome_food_model")
>>> classifier(image)
476
477
478
479
480
[{'score': 0.31856709718704224, 'label': 'beignets'},
 {'score': 0.015232225880026817, 'label': 'bruschetta'},
 {'score': 0.01519392803311348, 'label': 'chicken_wings'},
 {'score': 0.013022331520915031, 'label': 'pork_chop'},
 {'score': 0.012728818692266941, 'label': 'prime_rib'}]
481
```
482

483
484
485
486
You can also manually replicate the results of the `pipeline` if you'd like:

<frameworkcontent>
<pt>
487
Load an image processor to preprocess the image and return the `input` as PyTorch tensors:
488
489

```py
490
>>> from transformers import AutoImageProcessor
491
492
>>> import torch

493
494
>>> image_processor = AutoImageProcessor.from_pretrained("my_awesome_food_model")
>>> inputs = image_processor(image, return_tensors="pt")
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
```

Pass your inputs to the model and return the logits:

```py
>>> from transformers import AutoModelForImageClassification

>>> model = AutoModelForImageClassification.from_pretrained("my_awesome_food_model")
>>> with torch.no_grad():
...     logits = model(**inputs).logits
```

Get the predicted label with the highest probability, and use the model's `id2label` mapping to convert it to a label:

```py
>>> predicted_label = logits.argmax(-1).item()
>>> model.config.id2label[predicted_label]
'beignets'
```
</pt>
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
</frameworkcontent>

<frameworkcontent>
<tf>
Load an image processor to preprocess the image and return the `input` as TensorFlow tensors:

```py
>>> from transformers import AutoImageProcessor

>>> image_processor = AutoImageProcessor.from_pretrained("MariaK/food_classifier")
>>> inputs = image_processor(image, return_tensors="tf")
```

Pass your inputs to the model and return the logits:

```py
>>> from transformers import TFAutoModelForImageClassification

>>> model = TFAutoModelForImageClassification.from_pretrained("MariaK/food_classifier")
>>> logits = model(**inputs).logits
```

Get the predicted label with the highest probability, and use the model's `id2label` mapping to convert it to a label:

```py
>>> predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0])
>>> model.config.id2label[predicted_class_id]
'beignets'
```

</tf>
</frameworkcontent>