segformer.md 9.94 KB
Newer Older
Sylvain Gugger's avatar
Sylvain Gugger committed
1
2
3
4
5
6
7
8
9
10
<!--Copyright 2021 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
11
12
13
14

鈿狅笍 Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

Sylvain Gugger's avatar
Sylvain Gugger committed
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
-->

# SegFormer

## Overview

The SegFormer model was proposed in [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping
Luo. The model consists of a hierarchical Transformer encoder and a lightweight all-MLP decode head to achieve great
results on image segmentation benchmarks such as ADE20K and Cityscapes.

The abstract from the paper is the following:

*We present SegFormer, a simple, efficient yet powerful semantic segmentation framework which unifies Transformers with
lightweight multilayer perception (MLP) decoders. SegFormer has two appealing features: 1) SegFormer comprises a novel
hierarchically structured Transformer encoder which outputs multiscale features. It does not need positional encoding,
thereby avoiding the interpolation of positional codes which leads to decreased performance when the testing resolution
differs from training. 2) SegFormer avoids complex decoders. The proposed MLP decoder aggregates information from
different layers, and thus combining both local attention and global attention to render powerful representations. We
show that this simple and lightweight design is the key to efficient segmentation on Transformers. We scale our
approach up to obtain a series of models from SegFormer-B0 to SegFormer-B5, reaching significantly better performance
and efficiency than previous counterparts. For example, SegFormer-B4 achieves 50.3% mIoU on ADE20K with 64M parameters,
being 5x smaller and 2.2% better than the previous best method. Our best model, SegFormer-B5, achieves 84.0% mIoU on
Cityscapes validation set and shows excellent zero-shot robustness on Cityscapes-C.*

The figure below illustrates the architecture of SegFormer. Taken from the [original paper](https://arxiv.org/abs/2105.15203).

<img width="600" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/segformer_architecture.png"/>

amyeroberts's avatar
amyeroberts committed
43
This model was contributed by [nielsr](https://huggingface.co/nielsr). The TensorFlow version
44
of the model was contributed by [sayakpaul](https://huggingface.co/sayakpaul). The original code can be found [here](https://github.com/NVlabs/SegFormer).
Sylvain Gugger's avatar
Sylvain Gugger committed
45

46
## Usage tips
Sylvain Gugger's avatar
Sylvain Gugger committed
47

48
- SegFormer consists of a hierarchical Transformer encoder, and a lightweight all-MLP decoder head.
Sylvain Gugger's avatar
Sylvain Gugger committed
49
  [`SegformerModel`] is the hierarchical Transformer encoder (which in the paper is also referred to
50
  as Mix Transformer or MiT). [`SegformerForSemanticSegmentation`] adds the all-MLP decoder head on
Sylvain Gugger's avatar
Sylvain Gugger committed
51
52
53
54
55
56
57
  top to perform semantic segmentation of images. In addition, there's
  [`SegformerForImageClassification`] which can be used to - you guessed it - classify images. The
  authors of SegFormer first pre-trained the Transformer encoder on ImageNet-1k to classify images. Next, they throw
  away the classification head, and replace it by the all-MLP decode head. Next, they fine-tune the model altogether on
  ADE20K, Cityscapes and COCO-stuff, which are important benchmarks for semantic segmentation. All checkpoints can be
  found on the [hub](https://huggingface.co/models?other=segformer).
- The quickest way to get started with SegFormer is by checking the [example notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/SegFormer) (which showcase both inference and
58
  fine-tuning on custom data). One can also check out the [blog post](https://huggingface.co/blog/fine-tune-segformer) introducing SegFormer and illustrating how it can be fine-tuned on custom data.
59
60
61
- TensorFlow users should refer to [this repository](https://github.com/deep-diver/segformer-tf-transformers) that shows off-the-shelf inference and fine-tuning.
- One can also check out [this interactive demo on Hugging Face Spaces](https://huggingface.co/spaces/chansung/segformer-tf-transformers)
  to try out a SegFormer model on custom images.
amyeroberts's avatar
amyeroberts committed
62
- SegFormer works on any input size, as it pads the input to be divisible by `config.patch_sizes`.
63
64
- One can use [`SegformerImageProcessor`] to prepare images and corresponding segmentation maps
  for the model. Note that this image processor is fairly basic and does not include all data augmentations used in
Sylvain Gugger's avatar
Sylvain Gugger committed
65
66
67
  the original paper. The original preprocessing pipelines (for the ADE20k dataset for instance) can be found [here](https://github.com/NVlabs/SegFormer/blob/master/local_configs/_base_/datasets/ade20k_repeat.py). The most
  important preprocessing step is that images and segmentation maps are randomly cropped and padded to the same size,
  such as 512x512 or 640x640, after which they are normalized.
68
- One additional thing to keep in mind is that one can initialize [`SegformerImageProcessor`] with
69
  `reduce_labels` set to `True` or `False`. In some datasets (like ADE20k), the 0 index is used in the annotated
Sylvain Gugger's avatar
Sylvain Gugger committed
70
71
72
73
74
  segmentation maps for background. However, ADE20k doesn't include the "background" class in its 150 labels.
  Therefore, `reduce_labels` is used to reduce all labels by 1, and to make sure no loss is computed for the
  background class (i.e. it replaces 0 in the annotated maps by 255, which is the *ignore_index* of the loss function
  used by [`SegformerForSemanticSegmentation`]). However, other datasets use the 0 index as
  background class and include this class as part of all labels. In that case, `reduce_labels` should be set to
75
  `False`, as loss should also be computed for the background class.
76
77
- As most models, SegFormer comes in different sizes, the details of which can be found in the table below
  (taken from Table 7 of the [original paper](https://arxiv.org/abs/2105.15203)).
Sylvain Gugger's avatar
Sylvain Gugger committed
78
79

| **Model variant** | **Depths**    | **Hidden sizes**    | **Decoder hidden size** | **Params (M)** | **ImageNet-1k Top 1** |
80
| :---------------: | ------------- | ------------------- | :---------------------: | :------------: | :-------------------: |
Sylvain Gugger's avatar
Sylvain Gugger committed
81
82
83
84
85
86
87
| MiT-b0            | [2, 2, 2, 2]  | [32, 64, 160, 256]  | 256                     | 3.7            | 70.5                  |
| MiT-b1            | [2, 2, 2, 2]  | [64, 128, 320, 512] | 256                     | 14.0           | 78.7                  |
| MiT-b2            | [3, 4, 6, 3]  | [64, 128, 320, 512] | 768                     | 25.4           | 81.6                  |
| MiT-b3            | [3, 4, 18, 3] | [64, 128, 320, 512] | 768                     | 45.2           | 83.1                  |
| MiT-b4            | [3, 8, 27, 3] | [64, 128, 320, 512] | 768                     | 62.6           | 83.6                  |
| MiT-b5            | [3, 6, 40, 3] | [64, 128, 320, 512] | 768                     | 82.0           | 83.8                  |

88
89
90
Note that MiT in the above table refers to the Mix Transformer encoder backbone introduced in SegFormer. For
SegFormer's results on the segmentation datasets like ADE20k, refer to the [paper](https://arxiv.org/abs/2105.15203).

NielsRogge's avatar
NielsRogge committed
91
92
93
94
95
96
97
## Resources

A list of official Hugging Face and community (indicated by 馃寧) resources to help you get started with SegFormer.

<PipelineTag pipeline="image-classification"/>

- [`SegformerForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
98
- [Image classification task guide](../tasks/image_classification)
NielsRogge's avatar
NielsRogge committed
99
100
101
102
103
104
105

Semantic segmentation:

- [`SegformerForSemanticSegmentation`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/semantic-segmentation).
- A blog on fine-tuning SegFormer on a custom dataset can be found [here](https://huggingface.co/blog/fine-tune-segformer).
- More demo notebooks on SegFormer (both inference + fine-tuning on a custom dataset) can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/SegFormer).
- [`TFSegformerForSemanticSegmentation`] is supported by this [example notebook](https://github.com/huggingface/notebooks/blob/main/examples/semantic_segmentation-tf.ipynb).
106
- [Semantic segmentation task guide](../tasks/semantic_segmentation)
NielsRogge's avatar
NielsRogge committed
107
108

If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
109

Sylvain Gugger's avatar
Sylvain Gugger committed
110
111
112
113
114
115
116
117
## SegformerConfig

[[autodoc]] SegformerConfig

## SegformerFeatureExtractor

[[autodoc]] SegformerFeatureExtractor
    - __call__
118
    - post_process_semantic_segmentation
Sylvain Gugger's avatar
Sylvain Gugger committed
119

amyeroberts's avatar
amyeroberts committed
120
121
122
123
124
125
## SegformerImageProcessor

[[autodoc]] SegformerImageProcessor
    - preprocess
    - post_process_semantic_segmentation

126
127
128
<frameworkcontent>
<pt>

Sylvain Gugger's avatar
Sylvain Gugger committed
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
## SegformerModel

[[autodoc]] SegformerModel
    - forward

## SegformerDecodeHead

[[autodoc]] SegformerDecodeHead
    - forward

## SegformerForImageClassification

[[autodoc]] SegformerForImageClassification
    - forward

## SegformerForSemanticSegmentation

[[autodoc]] SegformerForSemanticSegmentation
    - forward
148

149
150
151
</pt>
<tf>

152
153
154
155
156
157
158
159
## TFSegformerDecodeHead

[[autodoc]] TFSegformerDecodeHead
    - call

## TFSegformerModel

[[autodoc]] TFSegformerModel
amyeroberts's avatar
amyeroberts committed
160
    - call
161
162
163
164

## TFSegformerForImageClassification

[[autodoc]] TFSegformerForImageClassification
amyeroberts's avatar
amyeroberts committed
165
    - call
166
167
168
169

## TFSegformerForSemanticSegmentation

[[autodoc]] TFSegformerForSemanticSegmentation
amyeroberts's avatar
amyeroberts committed
170
    - call
171
172
173

</tf>
</frameworkcontent>