image_classification.mdx 10.6 KB
Newer Older
Steven Liu's avatar
Steven Liu committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
<!--Copyright 2022 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# Image classification

15
16
[[open-in-colab]]

Steven Liu's avatar
Steven Liu committed
17
18
<Youtube id="tjAIM7BOYhw"/>

19
20
21
Image classification assigns a label or class to an image. Unlike text or audio classification, the inputs are the pixel values that comprise an image. There are many applications for image classification such as detecting damage after a natural disaster, monitoring crop health, or helping screen medical images for signs of disease.

This guide will show you how to:
Steven Liu's avatar
Steven Liu committed
22

23
24
1. Finetune [ViT](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/vit) on the [Food-101](https://huggingface.co/datasets/food101) dataset to classify a food item in an image.
2. Use your finetuned model for inference.
Steven Liu's avatar
Steven Liu committed
25
26
27

<Tip>

28
See the image classification [task page](https://huggingface.co/tasks/image-classification) for more information about its associated models, datasets, and metrics.
Steven Liu's avatar
Steven Liu committed
29
30
31

</Tip>

32
33
34
35
36
37
38
39
40
41
42
43
44
45
Before you begin, make sure you have all the necessary libraries installed:

```bash
pip install transformers datasets evaluate
```

We encourage you to login to your Hugging Face account so you can upload and share your model with the community. When prompted, enter your token to login:

```py
>>> from huggingface_hub import notebook_login

>>> notebook_login()
```

Steven Liu's avatar
Steven Liu committed
46
47
## Load Food-101 dataset

48
Start by loading a smaller subset of the Food-101 dataset from the 馃 Datasets library. This'll give you a chance to experiment and make sure everythings works before spending more time training on the full dataset.
Steven Liu's avatar
Steven Liu committed
49
50
51
52
53
54
55

```py
>>> from datasets import load_dataset

>>> food = load_dataset("food101", split="train[:5000]")
```

56
Split the dataset's `train` split into a train and test set with the [`~datasets.Dataset.train_test_split`] method:
Steven Liu's avatar
Steven Liu committed
57
58
59
60
61
62
63
64
65
66
67
68
69

```py
>>> food = food.train_test_split(test_size=0.2)
```

Then take a look at an example:

```py
>>> food["train"][0]
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F52AFC8AC50>,
 'label': 79}
```

70
71
72
73
74
75
There are two fields:

- `image`: a PIL image of the food item.
- `label`: the label class of the food item.

To make it easier for the model to get the label name from the label id, create a dictionary that maps the label name to an integer and vice versa:
Steven Liu's avatar
Steven Liu committed
76
77
78
79
80
81
82
83
84

```py
>>> labels = food["train"].features["label"].names
>>> label2id, id2label = dict(), dict()
>>> for i, label in enumerate(labels):
...     label2id[label] = str(i)
...     id2label[str(i)] = label
```

85
Now you can convert the label id to a label name:
Steven Liu's avatar
Steven Liu committed
86
87
88
89
90
91
92
93

```py
>>> id2label[str(79)]
'prime_rib'
```

## Preprocess

94
The next step is to load a ViT image processor to process the image into a tensor:
Steven Liu's avatar
Steven Liu committed
95
96

```py
97
>>> from transformers import AutoImageProcessor
Steven Liu's avatar
Steven Liu committed
98

99
>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
Steven Liu's avatar
Steven Liu committed
100
101
```

102
Apply some image transformations to the images to make the model more robust against overfitting. Here you'll use torchvision's [`transforms`](https://pytorch.org/vision/stable/transforms.html) module, but you can also use any image library you like.
103
104

Crop a random part of the image, resize it, and normalize it with the image mean and standard deviation:
Steven Liu's avatar
Steven Liu committed
105
106
107
108

```py
>>> from torchvision.transforms import RandomResizedCrop, Compose, Normalize, ToTensor

109
>>> normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std)
amyeroberts's avatar
amyeroberts committed
110
>>> size = (
111
112
113
...     image_processor.size["shortest_edge"]
...     if "shortest_edge" in image_processor.size
...     else (image_processor.size["height"], image_processor.size["width"])
amyeroberts's avatar
amyeroberts committed
114
115
... )
>>> _transforms = Compose([RandomResizedCrop(size), ToTensor(), normalize])
Steven Liu's avatar
Steven Liu committed
116
117
```

118
Then create a preprocessing function to apply the transforms and return the `pixel_values` - the inputs to the model - of the image:
Steven Liu's avatar
Steven Liu committed
119
120
121
122
123
124
125
126

```py
>>> def transforms(examples):
...     examples["pixel_values"] = [_transforms(img.convert("RGB")) for img in examples["image"]]
...     del examples["image"]
...     return examples
```

127
To apply the preprocessing function over the entire dataset, use 馃 Datasets [`~datasets.Dataset.with_transform`] method. The transforms are applied on the fly when you load an element of the dataset:
Steven Liu's avatar
Steven Liu committed
128
129
130
131
132

```py
>>> food = food.with_transform(transforms)
```

133
Now create a batch of examples using [`DataCollatorWithPadding`]. Unlike other data collators in 馃 Transformers, the `DefaultDataCollator` does not apply additional preprocessing such as padding.
Steven Liu's avatar
Steven Liu committed
134
135
136
137
138
139
140

```py
>>> from transformers import DefaultDataCollator

>>> data_collator = DefaultDataCollator()
```

141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
## Evaluate

Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the 馃 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [accuracy](https://huggingface.co/spaces/evaluate-metric/accuracy) metric (see the 馃 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric):

```py
>>> import evaluate

>>> accuracy = evaluate.load("accuracy")
```

Then create a function that passes your predictions and labels to [`~evaluate.EvaluationModule.compute`] to calculate the accuracy:

```py
>>> import numpy as np


>>> def compute_metrics(eval_pred):
...     predictions = np.argmax(eval_pred.predictions, axis=1)
...     return accuracy.compute(predictions=predictions, references=eval_pred.label_ids)
```

Your `compute_metrics` function is ready to go now, and you'll return to it when you setup your training.

164
## Train
Steven Liu's avatar
Steven Liu committed
165

166
167
<frameworkcontent>
<pt>
168
169
170
171
172
173
<Tip>

If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#train-with-pytorch-trainer)!

</Tip>
You're ready to start training your model now! Load ViT with [`AutoModelForImageClassification`]. Specify the number of labels along with the number of expected labels, and the label mappings:
Steven Liu's avatar
Steven Liu committed
174
175
176
177
178
179
180
181
182
183
184
185
186
187

```py
>>> from transformers import AutoModelForImageClassification, TrainingArguments, Trainer

>>> model = AutoModelForImageClassification.from_pretrained(
...     "google/vit-base-patch16-224-in21k",
...     num_labels=len(labels),
...     id2label=id2label,
...     label2id=label2id,
... )
```

At this point, only three steps remain:

188
189
190
1. Define your training hyperparameters in [`TrainingArguments`]. It is important you don't remove unused columns because this'll drop the `image` column. Without the `image` column, you can't create `pixel_values`. Set `remove_unused_columns=False` to prevent this behavior! The only other required parameter is `output_dir` which specifies where to save your model. You'll push this model to the Hub by setting `push_to_hub=True` (you need to be signed in to Hugging Face to upload your model). At the end of each epoch, the [`Trainer`] will evaluate the accuracy and save the training checkpoint.
2. Pass the training arguments to [`Trainer`] along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [`~Trainer.train`] to finetune your model.
Steven Liu's avatar
Steven Liu committed
191
192
193

```py
>>> training_args = TrainingArguments(
194
195
196
197
198
...     output_dir="my_awesome_food_model",
...     remove_unused_columns=False,
...     evaluation_strategy="epoch",
...     save_strategy="epoch",
...     learning_rate=5e-5,
Steven Liu's avatar
Steven Liu committed
199
...     per_device_train_batch_size=16,
200
201
202
203
...     gradient_accumulation_steps=4,
...     per_device_eval_batch_size=16,
...     num_train_epochs=3,
...     warmup_ratio=0.1,
Steven Liu's avatar
Steven Liu committed
204
...     logging_steps=10,
205
206
207
...     load_best_model_at_end=True,
...     metric_for_best_model="accuracy",
...     push_to_hub=True,
Steven Liu's avatar
Steven Liu committed
208
209
210
211
212
213
214
215
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     data_collator=data_collator,
...     train_dataset=food["train"],
...     eval_dataset=food["test"],
216
...     tokenizer=image_processor,
217
...     compute_metrics=compute_metrics,
Steven Liu's avatar
Steven Liu committed
218
219
220
221
... )

>>> trainer.train()
```
222
223
224
225
226
227

Once training is completed, share your model to the Hub with the [`~transformers.Trainer.push_to_hub`] method so everyone can use your model:

```py
>>> trainer.push_to_hub()
```
228
229
</pt>
</frameworkcontent>
Steven Liu's avatar
Steven Liu committed
230
231
232

<Tip>

233
For a more in-depth example of how to finetune a model for image classification, take a look at the corresponding [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
Steven Liu's avatar
Steven Liu committed
234

235
</Tip>
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268

## Inference

Great, now that you've finetuned a model, you can use it for inference!

Load an image you'd like to run inference on:

```py
>>> ds = load_dataset("food101", split="validation[:10]")
>>> image = ds["image"][0]
```

<div class="flex justify-center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png" alt="image of beignets"/>
</div>

The simplest way to try out your finetuned model for inference is to use it in a [`pipeline`]. Instantiate a `pipeline` for image classification with your model, and pass your image to it:

```py
>>> from transformers import pipeline

>>> classifier = pipeline("image-classification", model="my_awesome_food_model")
>>> classifier(image)
[{'score': 0.35574808716773987, 'label': 'beignets'},
 {'score': 0.018057454377412796, 'label': 'chicken_wings'},
 {'score': 0.017733804881572723, 'label': 'prime_rib'},
 {'score': 0.016335085034370422, 'label': 'bruschetta'},
 {'score': 0.0160061065107584, 'label': 'ramen'}]
```
You can also manually replicate the results of the `pipeline` if you'd like:

<frameworkcontent>
<pt>
269
Load an image processor to preprocess the image and return the `input` as PyTorch tensors:
270
271

```py
272
>>> from transformers import AutoImageProcessor
273
274
>>> import torch

275
276
>>> image_processor = AutoImageProcessor.from_pretrained("my_awesome_food_model")
>>> inputs = image_processor(image, return_tensors="pt")
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
```

Pass your inputs to the model and return the logits:

```py
>>> from transformers import AutoModelForImageClassification

>>> model = AutoModelForImageClassification.from_pretrained("my_awesome_food_model")
>>> with torch.no_grad():
...     logits = model(**inputs).logits
```

Get the predicted label with the highest probability, and use the model's `id2label` mapping to convert it to a label:

```py
>>> predicted_label = logits.argmax(-1).item()
>>> model.config.id2label[predicted_label]
'beignets'
```
</pt>
</frameworkcontent>