dpt.md 4.63 KB
Newer Older
NielsRogge's avatar
NielsRogge committed
1
2
3
4
5
6
7
8
9
10
<!--Copyright 2022 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
11
12
13
14

鈿狅笍 Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

NielsRogge's avatar
NielsRogge committed
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-->

# DPT

## Overview

The DPT model was proposed in [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by Ren茅 Ranftl, Alexey Bochkovskiy, Vladlen Koltun.
DPT is a model that leverages the [Vision Transformer (ViT)](vit) as backbone for dense prediction tasks like semantic segmentation and depth estimation.

The abstract from the paper is the following:

*We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe an improvement of up to 28% in relative performance when compared to a state-of-the-art fully-convolutional network. When applied to semantic segmentation, dense vision transformers set a new state of the art on ADE20K with 49.02% mIoU. We further show that the architecture can be fine-tuned on smaller datasets such as NYUv2, KITTI, and Pascal Context where it also sets the new state of the art.*

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/dpt_architecture.jpg"
amyeroberts's avatar
amyeroberts committed
29
alt="drawing" width="600"/>
NielsRogge's avatar
NielsRogge committed
30
31
32
33
34

<small> DPT architecture. Taken from the <a href="https://arxiv.org/abs/2103.13413" target="_blank">original paper</a>. </small>

This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/isl-org/DPT).

35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
## Usage tips

DPT is compatible with the [`AutoBackbone`] class. This allows to use the DPT framework with various computer vision backbones available in the library, such as [`VitDetBackbone`] or [`Dinov2Backbone`]. One can create it as follows:

```python
from transformers import Dinov2Config, DPTConfig, DPTForDepthEstimation

# initialize with a Transformer-based backbone such as DINOv2
# in that case, we also specify `reshape_hidden_states=False` to get feature maps of shape (batch_size, num_channels, height, width)
backbone_config = Dinov2Config.from_pretrained("facebook/dinov2-base", out_features=["stage1", "stage2", "stage3", "stage4"], reshape_hidden_states=False)

config = DPTConfig(backbone_config=backbone_config)
model = DPTForDepthEstimation(config=config)
```

NielsRogge's avatar
NielsRogge committed
50
51
52
53
54
## Resources

A list of official Hugging Face and community (indicated by 馃寧) resources to help you get started with DPT.

- Demo notebooks for [`DPTForDepthEstimation`] can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DPT).
55

56
- [Semantic segmentation task guide](../tasks/semantic_segmentation)
57
- [Monocular depth estimation task guide](../tasks/monocular_depth_estimation)
NielsRogge's avatar
NielsRogge committed
58
59
60

If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.

NielsRogge's avatar
NielsRogge committed
61
62
63
64
65
66
67
68
## DPTConfig

[[autodoc]] DPTConfig

## DPTFeatureExtractor

[[autodoc]] DPTFeatureExtractor
    - __call__
69
    - post_process_semantic_segmentation
NielsRogge's avatar
NielsRogge committed
70

amyeroberts's avatar
amyeroberts committed
71
72
73
74
75
76
## DPTImageProcessor

[[autodoc]] DPTImageProcessor
    - preprocess
    - post_process_semantic_segmentation

NielsRogge's avatar
NielsRogge committed
77
78
79
80
81
82
83
84
85
86
87
88
89
## DPTModel

[[autodoc]] DPTModel
    - forward

## DPTForDepthEstimation

[[autodoc]] DPTForDepthEstimation
    - forward

## DPTForSemanticSegmentation

[[autodoc]] DPTForSemanticSegmentation
amyeroberts's avatar
amyeroberts committed
90
    - forward