Unverified Commit 0c55d47c authored by NielsRogge's avatar NielsRogge Committed by GitHub
Browse files

Add GLPN (#16199)



* First draft

* Fix logits calculation

* Improve tests

* Add copied from statements

* Fix base_model_prefix

* Improve implementation, upload new models

* Update design

* Fix integration test

* Add model to README and toctree

* Add document image

* Apply suggestions from code review

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Add decoder_hidden_size attribute

* Update design of decoder

* Add DepthEstimatorOutput class

* Rename in_index to head_in_index and add feature extractor tests

* Apply suggestions from code review

* Apply suggestions from code review

* Update pretrained model name and add to doc tests

* Remove test.py script

* Update copied from statements and clean up
Co-authored-by: default avatarNiels Rogge <nielsrogge@Nielss-MacBook-Pro.local>
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
parent df32b5d8
......@@ -265,6 +265,7 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
1. **[FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon.
1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
1. **[GLPN](https://huggingface.co/docs/transformers/master/model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim.
1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
......
......@@ -244,6 +244,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
1. **[FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon.
1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
1. **[GLPN](https://huggingface.co/docs/transformers/master/model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim.
1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
1. **[GPT Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo)** (from EleutherAI) released in the repository [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy.
1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
......
......@@ -268,6 +268,7 @@ conda install -c huggingface transformers
1. **[FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert)** (来自 CNRS) 伴随论文 [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) 由 Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab 发布。
1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (来自 Google Research) 伴随论文 [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) 由 James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon 发布。
1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (来自 CMU/Google Brain) 伴随论文 [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) 由 Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le 发布。
1. **[GLPN](https://huggingface.co/docs/transformers/master/model_doc/glpn)** (来自 KAIST) 伴随论文 [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) 由 Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim 发布。
1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (来自 OpenAI) 伴随论文 [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) 由 Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever 发布。
1. **[GPT Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo)** (来自 EleutherAI) 随仓库 [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) 发布。作者为 Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy 发布。
1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (来自 OpenAI) 伴随论文 [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) 由 Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever** 发布。
......
......@@ -280,6 +280,7 @@ conda install -c huggingface transformers
1. **[FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon.
1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
1. **[GLPN](https://huggingface.co/docs/transformers/master/model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim.
1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
1. **[GPT Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo)** (from EleutherAI) released in the repository [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy.
1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
......
......@@ -218,6 +218,8 @@
title: FSMT
- local: model_doc/funnel
title: Funnel Transformer
- local: model_doc/glpn
title: GLPN
- local: model_doc/herbert
title: HerBERT
- local: model_doc/ibert
......
......@@ -89,6 +89,7 @@ conversion utilities for the following models.
1. **[FlauBERT](model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
1. **[FNet](model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon.
1. **[Funnel Transformer](model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
1. **[GLPN](model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim.
1. **[GPT](model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
1. **[GPT-2](model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
1. **[GPT-J](model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
......@@ -200,6 +201,7 @@ Flax), PyTorch, and/or TensorFlow.
| FlauBERT | ✅ | ❌ | ✅ | ✅ | ❌ |
| FNet | ✅ | ✅ | ✅ | ❌ | ❌ |
| Funnel Transformer | ✅ | ✅ | ✅ | ✅ | ❌ |
| GLPN | ❌ | ❌ | ✅ | ❌ | ❌ |
| GPT Neo | ❌ | ❌ | ✅ | ❌ | ✅ |
| GPT-J | ❌ | ❌ | ✅ | ❌ | ✅ |
| Hubert | ❌ | ❌ | ✅ | ✅ | ❌ |
......
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# GLPN
<Tip>
This is a recently introduced model so the API hasn't been tested extensively. There may be some bugs or slight
breaking changes to fix it in the future. If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title).
</Tip>
## Overview
The GLPN model was proposed in [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim.
GLPN combines [SegFormer](segformer)'s hierarchical mix-Transformer with a lightweight decoder for monocular depth estimation. The proposed decoder shows better performance than the previously proposed decoders, with considerably
less computational complexity.
The abstract from the paper is the following:
*Depth estimation from a single image is an important task that can be applied to various fields in computer vision, and has grown rapidly with the development of convolutional neural networks. In this paper, we propose a novel structure and training strategy for monocular depth estimation to further improve the prediction accuracy of the network. We deploy a hierarchical transformer encoder to capture and convey the global context, and design a lightweight yet powerful decoder to generate an estimated depth map while considering local connectivity. By constructing connected paths between multi-scale local features and the global decoding stream with our proposed selective feature fusion module, the network can integrate both representations and recover fine details. In addition, the proposed decoder shows better performance than the previously proposed decoders, with considerably less computational complexity. Furthermore, we improve the depth-specific augmentation method by utilizing an important observation in depth estimation to enhance the model. Our network achieves state-of-the-art performance over the challenging depth dataset NYU Depth V2. Extensive experiments have been conducted to validate and show the effectiveness of the proposed approach. Finally, our model shows better generalisation ability and robustness than other comparative models.*
Tips:
- One can use [`GLPNFeatureExtractor`] to prepare images for the model.
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/glpn_architecture.jpg"
alt="drawing" width="600"/>
<small> Summary of the approach. Taken from the <a href="https://arxiv.org/abs/2201.07436" target="_blank">original paper</a>. </small>
This model was contributed by [niels](<https://huggingface.co/nielsr). The original code can be found [here](https://github.com/vinvino02/GLPDepth).
## GLPNConfig
[[autodoc]] GLPNConfig
## GLPNFeatureExtractor
[[autodoc]] GLPNFeatureExtractor
- __call__
## GLPNModel
[[autodoc]] GLPNModel
- forward
## GLPNForDepthEstimation
[[autodoc]] GLPNForDepthEstimation
- forward
\ No newline at end of file
......@@ -224,6 +224,7 @@ _import_structure = {
"models.fnet": ["FNET_PRETRAINED_CONFIG_ARCHIVE_MAP", "FNetConfig", "FNetTokenizer"],
"models.fsmt": ["FSMT_PRETRAINED_CONFIG_ARCHIVE_MAP", "FSMTConfig", "FSMTTokenizer"],
"models.funnel": ["FUNNEL_PRETRAINED_CONFIG_ARCHIVE_MAP", "FunnelConfig", "FunnelTokenizer"],
"models.glpn": ["GLPN_PRETRAINED_CONFIG_ARCHIVE_MAP", "GLPNConfig"],
"models.gpt2": ["GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPT2Config", "GPT2Tokenizer"],
"models.gpt_neo": ["GPT_NEO_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPTNeoConfig"],
"models.gptj": ["GPTJ_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPTJConfig"],
......@@ -525,6 +526,7 @@ if is_vision_available():
_import_structure["models.convnext"].append("ConvNextFeatureExtractor")
_import_structure["models.deit"].append("DeiTFeatureExtractor")
_import_structure["models.detr"].append("DetrFeatureExtractor")
_import_structure["models.glpn"].append("GLPNFeatureExtractor")
_import_structure["models.imagegpt"].append("ImageGPTFeatureExtractor")
_import_structure["models.layoutlmv2"].append("LayoutLMv2FeatureExtractor")
_import_structure["models.layoutlmv2"].append("LayoutLMv2Processor")
......@@ -993,6 +995,14 @@ if is_torch_available():
"load_tf_weights_in_funnel",
]
)
_import_structure["models.glpn"].extend(
[
"GLPN_PRETRAINED_MODEL_ARCHIVE_LIST",
"GLPNForDepthEstimation",
"GLPNModel",
"GLPNPreTrainedModel",
]
)
_import_structure["models.gpt2"].extend(
[
"GPT2_PRETRAINED_MODEL_ARCHIVE_LIST",
......@@ -2550,6 +2560,7 @@ if TYPE_CHECKING:
from .models.fnet import FNET_PRETRAINED_CONFIG_ARCHIVE_MAP, FNetConfig, FNetTokenizer
from .models.fsmt import FSMT_PRETRAINED_CONFIG_ARCHIVE_MAP, FSMTConfig, FSMTTokenizer
from .models.funnel import FUNNEL_PRETRAINED_CONFIG_ARCHIVE_MAP, FunnelConfig, FunnelTokenizer
from .models.glpn import GLPN_PRETRAINED_CONFIG_ARCHIVE_MAP, GLPNConfig
from .models.gpt2 import GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP, GPT2Config, GPT2Tokenizer
from .models.gpt_neo import GPT_NEO_PRETRAINED_CONFIG_ARCHIVE_MAP, GPTNeoConfig
from .models.gptj import GPTJ_PRETRAINED_CONFIG_ARCHIVE_MAP, GPTJConfig
......@@ -2803,6 +2814,7 @@ if TYPE_CHECKING:
from .models.convnext import ConvNextFeatureExtractor
from .models.deit import DeiTFeatureExtractor
from .models.detr import DetrFeatureExtractor
from .models.glpn import GLPNFeatureExtractor
from .models.imagegpt import ImageGPTFeatureExtractor
from .models.layoutlmv2 import LayoutLMv2FeatureExtractor, LayoutLMv2Processor
from .models.layoutxlm import LayoutXLMProcessor
......@@ -2841,6 +2853,7 @@ if TYPE_CHECKING:
from .utils.dummy_scatter_objects import *
if is_torch_available():
# Benchmarks
from .benchmark.benchmark import PyTorchBenchmark
from .benchmark.benchmark_args import PyTorchBenchmarkArguments
......@@ -3195,6 +3208,12 @@ if TYPE_CHECKING:
FunnelPreTrainedModel,
load_tf_weights_in_funnel,
)
from .models.glpn import (
GLPN_PRETRAINED_MODEL_ARCHIVE_LIST,
GLPNForDepthEstimation,
GLPNModel,
GLPNPreTrainedModel,
)
from .models.gpt2 import (
GPT2_PRETRAINED_MODEL_ARCHIVE_LIST,
GPT2DoubleHeadsModel,
......
......@@ -878,3 +878,33 @@ class ImageClassifierOutput(ModelOutput):
logits: torch.FloatTensor = None
hidden_states: Optional[Tuple[torch.FloatTensor]] = None
attentions: Optional[Tuple[torch.FloatTensor]] = None
@dataclass
class DepthEstimatorOutput(ModelOutput):
"""
Base class for outputs of depth estimation models.
Args:
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Classification (or regression if config.num_labels==1) loss.
predicted_depth (`torch.FloatTensor` of shape `(batch_size, height, width)`):
Predicted depth for each pixel.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, num_channels, height, width)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, patch_size,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
"""
loss: Optional[torch.FloatTensor] = None
predicted_depth: torch.FloatTensor = None
hidden_states: Optional[Tuple[torch.FloatTensor]] = None
attentions: Optional[Tuple[torch.FloatTensor]] = None
......@@ -55,6 +55,7 @@ from . import (
fnet,
fsmt,
funnel,
glpn,
gpt2,
gpt_neo,
gptj,
......
......@@ -30,6 +30,7 @@ logger = logging.get_logger(__name__)
CONFIG_MAPPING_NAMES = OrderedDict(
[
# Add configs here
("glpn", "GLPNConfig"),
("maskformer", "MaskFormerConfig"),
("poolformer", "PoolFormerConfig"),
("convnext", "ConvNextConfig"),
......@@ -132,6 +133,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
[
# Add archive maps here
("glpn", "GLPN_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("maskformer", "MASKFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("poolformer", "POOLFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("convnext", "CONVNEXT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
......@@ -221,6 +223,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
MODEL_NAMES_MAPPING = OrderedDict(
[
# Add full (and cased) model names here
("glpn", "GLPN"),
("maskformer", "MaskFormer"),
("poolformer", "PoolFormer"),
("convnext", "ConvNext"),
......
......@@ -28,6 +28,7 @@ logger = logging.get_logger(__name__)
MODEL_MAPPING_NAMES = OrderedDict(
[
# Base model mapping
("glpn", "GLPNModel"),
("maskformer", "MaskFormerModel"),
("poolformer", "PoolFormerModel"),
("convnext", "ConvNextModel"),
......
# flake8: noqa
# There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all.
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
# rely on isort to merge the imports
from ...file_utils import _LazyModule, is_torch_available, is_vision_available
_import_structure = {
"configuration_glpn": ["GLPN_PRETRAINED_CONFIG_ARCHIVE_MAP", "GLPNConfig"],
}
if is_vision_available():
_import_structure["feature_extraction_glpn"] = ["GLPNFeatureExtractor"]
if is_torch_available():
_import_structure["modeling_glpn"] = [
"GLPN_PRETRAINED_MODEL_ARCHIVE_LIST",
"GLPNForDepthEstimation",
"GLPNLayer",
"GLPNModel",
"GLPNPreTrainedModel",
]
if TYPE_CHECKING:
from .configuration_glpn import GLPN_PRETRAINED_CONFIG_ARCHIVE_MAP, GLPNConfig
if is_vision_available():
from .feature_extraction_glpn import GLPNFeatureExtractor
if is_torch_available():
from .modeling_glpn import (
GLPN_PRETRAINED_MODEL_ARCHIVE_LIST,
GLPNForDepthEstimation,
GLPNLayer,
GLPNModel,
GLPNPreTrainedModel,
)
else:
import sys
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
# coding=utf-8
# Copyright 2022 KAIST and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" GLPN model configuration"""
from ...configuration_utils import PretrainedConfig
from ...utils import logging
logger = logging.get_logger(__name__)
GLPN_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"vinvino02/glpn-kitti": "https://huggingface.co/vinvino02/gdpdepth-kitti/resolve/main/config.json",
# See all GLPN models at https://huggingface.co/models?filter=gdpdepth
}
class GLPNConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`GLPNModel`]. It is used to instantiate an GLPN
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a similar configuration to that of the GLPN
[kaist/gdpdepth-kitti](https://huggingface.co/kaist/gdpdepth-kitti) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
num_channels (`int`, *optional*, defaults to 3):
The number of input channels.
num_encoder_blocks (`int`, *optional*, defaults to 4):
The number of encoder blocks (i.e. stages in the Mix Transformer encoder).
depths (`List[int]`, *optional*, defaults to `[2, 2, 2, 2]`):
The number of layers in each encoder block.
sr_ratios (`List[int]`, *optional*, defaults to `[8, 4, 2, 1]`):
Sequence reduction ratios in each encoder block.
hidden_sizes (`List[int]`, *optional*, defaults to `[32, 64, 160, 256]`):
Dimension of each of the encoder blocks.
patch_sizes (`List[int]`, *optional*, defaults to `[7, 3, 3, 3]`):
Patch size before each encoder block.
strides (`List[int]`, *optional*, defaults to `[4, 2, 2, 2]`):
Stride before each encoder block.
num_attention_heads (`List[int]`, *optional*, defaults to `[1, 2, 4, 8]`):
Number of attention heads for each attention layer in each block of the Transformer encoder.
mlp_ratios (`List[int]`, *optional*, defaults to `[4, 4, 4, 4]`):
Ratio of the size of the hidden layer compared to the size of the input layer of the Mix FFNs in the
encoder blocks.
hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
`"relu"`, `"selu"` and `"gelu_new"` are supported.
hidden_dropout_prob (`float`, *optional*, defaults to 0.0):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (`float`, *optional*, defaults to 0.0):
The dropout ratio for the attention probabilities.
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
drop_path_rate (`float`, *optional*, defaults to 0.1):
The dropout probability for stochastic depth, used in the blocks of the Transformer encoder.
layer_norm_eps (`float`, *optional*, defaults to 1e-6):
The epsilon used by the layer normalization layers.
decoder_hidden_size (`int`, *optional*, defaults to 32):
The dimension of the decoder.
max_depth (`int`, *optional*, defaults to 10):
The maximum depth of the decoder.
head_in_index (`int`, *optional*, defaults to -1):
The index of the features to use in the head.
Example:
```python
>>> from transformers import GLPNModel, GLPNConfig
>>> # Initializing a GLPN kaist/gdpdepth-kitti style configuration
>>> configuration = GLPNConfig()
>>> # Initializing a model from the kaist/gdpdepth-kitti style configuration
>>> model = GLPNModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "glpn"
def __init__(
self,
num_channels=3,
num_encoder_blocks=4,
depths=[2, 2, 2, 2],
sr_ratios=[8, 4, 2, 1],
hidden_sizes=[32, 64, 160, 256],
patch_sizes=[7, 3, 3, 3],
strides=[4, 2, 2, 2],
num_attention_heads=[1, 2, 5, 8],
mlp_ratios=[4, 4, 4, 4],
hidden_act="gelu",
hidden_dropout_prob=0.0,
attention_probs_dropout_prob=0.0,
initializer_range=0.02,
drop_path_rate=0.1,
layer_norm_eps=1e-6,
is_encoder_decoder=False,
decoder_hidden_size=64,
max_depth=10,
head_in_index=-1,
**kwargs
):
super().__init__(**kwargs)
self.num_channels = num_channels
self.num_encoder_blocks = num_encoder_blocks
self.depths = depths
self.sr_ratios = sr_ratios
self.hidden_sizes = hidden_sizes
self.patch_sizes = patch_sizes
self.strides = strides
self.mlp_ratios = mlp_ratios
self.num_attention_heads = num_attention_heads
self.hidden_act = hidden_act
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.initializer_range = initializer_range
self.drop_path_rate = drop_path_rate
self.layer_norm_eps = layer_norm_eps
self.decoder_hidden_size = decoder_hidden_size
self.max_depth = max_depth
self.head_in_index = head_in_index
# coding=utf-8
# Copyright 2022 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Convert GLPN checkpoints."""
import argparse
from collections import OrderedDict
from pathlib import Path
import torch
from PIL import Image
import requests
from transformers import GLPNConfig, GLPNFeatureExtractor, GLPNForDepthEstimation
from transformers.utils import logging
logging.set_verbosity_info()
logger = logging.get_logger(__name__)
def rename_keys(state_dict):
new_state_dict = OrderedDict()
for key, value in state_dict.items():
if key.startswith("module.encoder"):
key = key.replace("module.encoder", "glpn.encoder")
if key.startswith("module.decoder"):
key = key.replace("module.decoder", "decoder.stages")
if "patch_embed" in key:
# replace for example patch_embed1 by patch_embeddings.0
idx = key[key.find("patch_embed") + len("patch_embed")]
key = key.replace(f"patch_embed{idx}", f"patch_embeddings.{int(idx)-1}")
if "norm" in key:
key = key.replace("norm", "layer_norm")
if "glpn.encoder.layer_norm" in key:
# replace for example layer_norm1 by layer_norm.0
idx = key[key.find("glpn.encoder.layer_norm") + len("glpn.encoder.layer_norm")]
key = key.replace(f"layer_norm{idx}", f"layer_norm.{int(idx)-1}")
if "layer_norm1" in key:
key = key.replace("layer_norm1", "layer_norm_1")
if "layer_norm2" in key:
key = key.replace("layer_norm2", "layer_norm_2")
if "block" in key:
# replace for example block1 by block.0
idx = key[key.find("block") + len("block")]
key = key.replace(f"block{idx}", f"block.{int(idx)-1}")
if "attn.q" in key:
key = key.replace("attn.q", "attention.self.query")
if "attn.proj" in key:
key = key.replace("attn.proj", "attention.output.dense")
if "attn" in key:
key = key.replace("attn", "attention.self")
if "fc1" in key:
key = key.replace("fc1", "dense1")
if "fc2" in key:
key = key.replace("fc2", "dense2")
if "linear_pred" in key:
key = key.replace("linear_pred", "classifier")
if "linear_fuse" in key:
key = key.replace("linear_fuse.conv", "linear_fuse")
key = key.replace("linear_fuse.bn", "batch_norm")
if "linear_c" in key:
# replace for example linear_c4 by linear_c.3
idx = key[key.find("linear_c") + len("linear_c")]
key = key.replace(f"linear_c{idx}", f"linear_c.{int(idx)-1}")
if "bot_conv" in key:
key = key.replace("bot_conv", "0.convolution")
if "skip_conv1" in key:
key = key.replace("skip_conv1", "1.convolution")
if "skip_conv2" in key:
key = key.replace("skip_conv2", "2.convolution")
if "fusion1" in key:
key = key.replace("fusion1", "1.fusion")
if "fusion2" in key:
key = key.replace("fusion2", "2.fusion")
if "fusion3" in key:
key = key.replace("fusion3", "3.fusion")
if "fusion" in key and "conv" in key:
key = key.replace("conv", "convolutional_layer")
if key.startswith("module.last_layer_depth"):
key = key.replace("module.last_layer_depth", "head.head")
new_state_dict[key] = value
return new_state_dict
def read_in_k_v(state_dict, config):
# for each of the encoder blocks:
for i in range(config.num_encoder_blocks):
for j in range(config.depths[i]):
# read in weights + bias of keys and values (which is a single matrix in the original implementation)
kv_weight = state_dict.pop(f"glpn.encoder.block.{i}.{j}.attention.self.kv.weight")
kv_bias = state_dict.pop(f"glpn.encoder.block.{i}.{j}.attention.self.kv.bias")
# next, add keys and values (in that order) to the state dict
state_dict[f"glpn.encoder.block.{i}.{j}.attention.self.key.weight"] = kv_weight[
: config.hidden_sizes[i], :
]
state_dict[f"glpn.encoder.block.{i}.{j}.attention.self.key.bias"] = kv_bias[: config.hidden_sizes[i]]
state_dict[f"glpn.encoder.block.{i}.{j}.attention.self.value.weight"] = kv_weight[
config.hidden_sizes[i] :, :
]
state_dict[f"glpn.encoder.block.{i}.{j}.attention.self.value.bias"] = kv_bias[config.hidden_sizes[i] :]
# We will verify our results on a COCO image
def prepare_img():
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
return image
@torch.no_grad()
def convert_glpn_checkpoint(checkpoint_path, pytorch_dump_folder_path, push_to_hub=False, model_name=None):
"""
Copy/paste/tweak model's weights to our GLPN structure.
"""
# load GLPN configuration (Segformer-B4 size)
config = GLPNConfig(hidden_sizes=[64, 128, 320, 512], decoder_hidden_size=64, depths=[3, 8, 27, 3])
# load feature extractor (only resize + rescale)
feature_extractor = GLPNFeatureExtractor()
# prepare image
image = prepare_img()
pixel_values = feature_extractor(images=image, return_tensors="pt").pixel_values
logger.info("Converting model...")
# load original state dict
state_dict = torch.load(checkpoint_path, map_location=torch.device("cpu"))
# rename keys
state_dict = rename_keys(state_dict)
# key and value matrices need special treatment
read_in_k_v(state_dict, config)
# create HuggingFace model and load state dict
model = GLPNForDepthEstimation(config)
model.load_state_dict(state_dict)
model.eval()
# forward pass
outputs = model(pixel_values)
predicted_depth = outputs.predicted_depth
# verify output
if model_name is not None:
if "nyu" in model_name:
expected_slice = torch.tensor(
[[4.4147, 4.0873, 4.0673], [3.7890, 3.2881, 3.1525], [3.7674, 3.5423, 3.4913]]
)
elif "kitti" in model_name:
expected_slice = torch.tensor(
[[3.4291, 2.7865, 2.5151], [3.2841, 2.7021, 2.3502], [3.1147, 2.4625, 2.2481]]
)
else:
raise ValueError(f"Unknown model name: {model_name}")
expected_shape = torch.Size([1, 480, 640])
assert predicted_depth.shape == expected_shape
assert torch.allclose(predicted_depth[0, :3, :3], expected_slice, atol=1e-4)
print("Looks ok!")
# finally, push to hub if required
if push_to_hub:
logger.info("Pushing model and feature extractor to the hub...")
model.push_to_hub(
repo_path_or_name=Path(pytorch_dump_folder_path, model_name),
organization="nielsr",
commit_message="Add model",
use_temp_dir=True,
)
feature_extractor.push_to_hub(
repo_path_or_name=Path(pytorch_dump_folder_path, model_name),
organization="nielsr",
commit_message="Add feature extractor",
use_temp_dir=True,
)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--checkpoint_path",
default=None,
type=str,
help="Path to the original PyTorch checkpoint (.pth file).",
)
parser.add_argument(
"--pytorch_dump_folder_path", default=None, type=str, help="Path to the folder to output PyTorch model."
)
parser.add_argument(
"--push_to_hub", action="store_true", help="Whether to upload the model to the HuggingFace hub."
)
parser.add_argument(
"--model_name",
default="glpn-kitti",
type=str,
help="Name of the model in case you're pushing to the hub.",
)
args = parser.parse_args()
convert_glpn_checkpoint(args.checkpoint_path, args.pytorch_dump_folder_path, args.push_to_hub, args.model_name)
# coding=utf-8
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Feature extractor class for GLPN."""
from typing import Optional, Union
import numpy as np
from PIL import Image
from ...feature_extraction_utils import BatchFeature, FeatureExtractionMixin
from ...file_utils import TensorType
from ...image_utils import ImageFeatureExtractionMixin, ImageInput, is_torch_tensor
from ...utils import logging
logger = logging.get_logger(__name__)
class GLPNFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionMixin):
r"""
Constructs a GLPN feature extractor.
This feature extractor inherits from [`FeatureExtractionMixin`] which contains most of the main methods. Users
should refer to this superclass for more information regarding those methods.
Args:
do_resize (`bool`, *optional*, defaults to `True`):
Whether to resize the input based on certain `size_divisor`.
size_divisor (`int` or `Tuple(int)`, *optional*, defaults to 32):
Make sure the input is divisible by this value. Only has an effect if `do_resize` is set to `True`.
resample (`int`, *optional*, defaults to `PIL.Image.BILINEAR`):
An optional resampling filter. This can be one of `PIL.Image.NEAREST`, `PIL.Image.BOX`,
`PIL.Image.BILINEAR`, `PIL.Image.HAMMING`, `PIL.Image.BICUBIC` or `PIL.Image.LANCZOS`. Only has an effect
if `do_resize` is set to `True`.
do_rescale (`bool`, *optional*, defaults to `True`):
Whether or not to apply the scaling factor (to make pixel values floats between 0. and 1.).
"""
model_input_names = ["pixel_values"]
def __init__(self, do_resize=True, size_divisor=32, resample=Image.BILINEAR, do_rescale=True, **kwargs):
super().__init__(**kwargs)
self.do_resize = do_resize
self.size_divisor = size_divisor
self.resample = resample
self.do_rescale = do_rescale
def _resize(self, image, size_divisor, resample):
if not isinstance(image, Image.Image):
image = self.to_pil_image(image)
width, height = image.size
new_h, new_w = height // size_divisor * size_divisor, width // size_divisor * size_divisor
image = self.resize(image, size=(new_w, new_h), resample=resample)
return image
def __call__(
self, images: ImageInput, return_tensors: Optional[Union[str, TensorType]] = None, **kwargs
) -> BatchFeature:
"""
Main method to prepare for the model one or several image(s).
<Tip warning={true}>
NumPy arrays and PyTorch tensors are converted to PIL images when resizing, so the most efficient is to pass
PIL images.
</Tip>
Args:
images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a
number of channels, H and W are image height and width.
return_tensors (`str` or [`~file_utils.TensorType`], *optional*, defaults to `'np'`):
If set, will return tensors of a particular framework. Acceptable values are:
- `'tf'`: Return TensorFlow `tf.constant` objects.
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return NumPy `np.ndarray` objects.
- `'jax'`: Return JAX `jnp.ndarray` objects.
Returns:
[`BatchFeature`]: A [`BatchFeature`] with the following fields:
- **pixel_values** -- Pixel values to be fed to a model, of shape (batch_size, num_channels, height,
width).
"""
# Input type checking for clearer error
valid_images = False
# Check that images has a valid type
if isinstance(images, (Image.Image, np.ndarray)) or is_torch_tensor(images):
valid_images = True
elif isinstance(images, (list, tuple)):
if len(images) == 0 or isinstance(images[0], (Image.Image, np.ndarray)) or is_torch_tensor(images[0]):
valid_images = True
if not valid_images:
raise ValueError(
"Images must of type `PIL.Image.Image`, `np.ndarray` or `torch.Tensor` (single example), "
"`List[PIL.Image.Image]`, `List[np.ndarray]` or `List[torch.Tensor]` (batch of examples)."
)
is_batched = bool(
isinstance(images, (list, tuple))
and (isinstance(images[0], (Image.Image, np.ndarray)) or is_torch_tensor(images[0]))
)
if not is_batched:
images = [images]
# transformations (resizing + rescaling)
if self.do_resize and self.size_divisor is not None:
images = [
self._resize(image=image, size_divisor=self.size_divisor, resample=self.resample) for image in images
]
if self.do_rescale:
images = [self.to_numpy_array(image=image) for image in images]
# return as BatchFeature
data = {"pixel_values": images}
encoded_inputs = BatchFeature(data=data, tensor_type=return_tensors)
return encoded_inputs
# coding=utf-8
# Copyright 2022 KAIST and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" PyTorch GLPN model."""
import math
from typing import List, Optional, Tuple, Union
import torch
import torch.utils.checkpoint
from torch import nn
from ...activations import ACT2FN
from ...file_utils import (
add_code_sample_docstrings,
add_start_docstrings,
add_start_docstrings_to_model_forward,
replace_return_docstrings,
)
from ...modeling_outputs import BaseModelOutput, DepthEstimatorOutput
from ...modeling_utils import PreTrainedModel, find_pruneable_heads_and_indices, prune_linear_layer
from ...utils import logging
from .configuration_glpn import GLPNConfig
logger = logging.get_logger(__name__)
# General docstring
_CONFIG_FOR_DOC = "GLPNConfig"
_FEAT_EXTRACTOR_FOR_DOC = "GLPNFeatureExtractor"
# Base docstring
_CHECKPOINT_FOR_DOC = "vinvino02/glpn-kitti"
_EXPECTED_OUTPUT_SHAPE = [1, 512, 15, 20]
GLPN_PRETRAINED_MODEL_ARCHIVE_LIST = [
"vinvino02/glpn-kitti",
# See all GLPN models at https://huggingface.co/models?filter=glpn
]
# Copied from transformers.models.segformer.modeling_segformer.drop_path
def drop_path(x, drop_prob: float = 0.0, training: bool = False):
"""
Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks). This is the same as the
DropConnect impl I created for EfficientNet, etc networks, however, the original name is misleading as 'Drop
Connect' is a different form of dropout in a separate paper... See discussion:
https://github.com/tensorflow/tpu/issues/494#issuecomment-532968956 ... I've opted for changing the layer and
argument names to 'drop path' rather than mix DropConnect as a layer name and use 'survival rate' as the argument.
"""
if drop_prob == 0.0 or not training:
return x
keep_prob = 1 - drop_prob
shape = (x.shape[0],) + (1,) * (x.ndim - 1) # work with diff dim tensors, not just 2D ConvNets
random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
random_tensor.floor_() # binarize
output = x.div(keep_prob) * random_tensor
return output
# Copied from transformers.models.segformer.modeling_segformer.SegformerDropPath
class GLPNDropPath(nn.Module):
"""Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks)."""
def __init__(self, drop_prob=None):
super().__init__()
self.drop_prob = drop_prob
def forward(self, x):
return drop_path(x, self.drop_prob, self.training)
# Copied from transformers.models.segformer.modeling_segformer.SegformerOverlapPatchEmbeddings
class GLPNOverlapPatchEmbeddings(nn.Module):
"""Construct the overlapping patch embeddings."""
def __init__(self, patch_size, stride, num_channels, hidden_size):
super().__init__()
self.proj = nn.Conv2d(
num_channels,
hidden_size,
kernel_size=patch_size,
stride=stride,
padding=patch_size // 2,
)
self.layer_norm = nn.LayerNorm(hidden_size)
def forward(self, pixel_values):
embeddings = self.proj(pixel_values)
_, _, height, width = embeddings.shape
# (batch_size, num_channels, height, width) -> (batch_size, num_channels, height*width) -> (batch_size, height*width, num_channels)
# this can be fed to a Transformer layer
embeddings = embeddings.flatten(2).transpose(1, 2)
embeddings = self.layer_norm(embeddings)
return embeddings, height, width
# Copied from transformers.models.segformer.modeling_segformer.SegformerEfficientSelfAttention
class GLPNEfficientSelfAttention(nn.Module):
"""SegFormer's efficient self-attention mechanism. Employs the sequence reduction process introduced in the [PvT
paper](https://arxiv.org/abs/2102.12122)."""
def __init__(self, config, hidden_size, num_attention_heads, sequence_reduction_ratio):
super().__init__()
self.hidden_size = hidden_size
self.num_attention_heads = num_attention_heads
if self.hidden_size % self.num_attention_heads != 0:
raise ValueError(
f"The hidden size ({self.hidden_size}) is not a multiple of the number of attention "
f"heads ({self.num_attention_heads})"
)
self.attention_head_size = int(self.hidden_size / self.num_attention_heads)
self.all_head_size = self.num_attention_heads * self.attention_head_size
self.query = nn.Linear(self.hidden_size, self.all_head_size)
self.key = nn.Linear(self.hidden_size, self.all_head_size)
self.value = nn.Linear(self.hidden_size, self.all_head_size)
self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
self.sr_ratio = sequence_reduction_ratio
if sequence_reduction_ratio > 1:
self.sr = nn.Conv2d(
hidden_size, hidden_size, kernel_size=sequence_reduction_ratio, stride=sequence_reduction_ratio
)
self.layer_norm = nn.LayerNorm(hidden_size)
def transpose_for_scores(self, hidden_states):
new_shape = hidden_states.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
hidden_states = hidden_states.view(*new_shape)
return hidden_states.permute(0, 2, 1, 3)
def forward(
self,
hidden_states,
height,
width,
output_attentions=False,
):
query_layer = self.transpose_for_scores(self.query(hidden_states))
if self.sr_ratio > 1:
batch_size, seq_len, num_channels = hidden_states.shape
# Reshape to (batch_size, num_channels, height, width)
hidden_states = hidden_states.permute(0, 2, 1).reshape(batch_size, num_channels, height, width)
# Apply sequence reduction
hidden_states = self.sr(hidden_states)
# Reshape back to (batch_size, seq_len, num_channels)
hidden_states = hidden_states.reshape(batch_size, num_channels, -1).permute(0, 2, 1)
hidden_states = self.layer_norm(hidden_states)
key_layer = self.transpose_for_scores(self.key(hidden_states))
value_layer = self.transpose_for_scores(self.value(hidden_states))
# Take the dot product between "query" and "key" to get the raw attention scores.
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
# Normalize the attention scores to probabilities.
attention_probs = nn.functional.softmax(attention_scores, dim=-1)
# This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper.
attention_probs = self.dropout(attention_probs)
context_layer = torch.matmul(attention_probs, value_layer)
context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
context_layer = context_layer.view(*new_context_layer_shape)
outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
return outputs
# Copied from transformers.models.segformer.modeling_segformer.SegformerSelfOutput
class GLPNSelfOutput(nn.Module):
def __init__(self, config, hidden_size):
super().__init__()
self.dense = nn.Linear(hidden_size, hidden_size)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
def forward(self, hidden_states, input_tensor):
hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states)
return hidden_states
# Copied from transformers.models.segformer.modeling_segformer.SegformerAttention with Segformer->GLPN
class GLPNAttention(nn.Module):
def __init__(self, config, hidden_size, num_attention_heads, sequence_reduction_ratio):
super().__init__()
self.self = GLPNEfficientSelfAttention(
config=config,
hidden_size=hidden_size,
num_attention_heads=num_attention_heads,
sequence_reduction_ratio=sequence_reduction_ratio,
)
self.output = GLPNSelfOutput(config, hidden_size=hidden_size)
self.pruned_heads = set()
def prune_heads(self, heads):
if len(heads) == 0:
return
heads, index = find_pruneable_heads_and_indices(
heads, self.self.num_attention_heads, self.self.attention_head_size, self.pruned_heads
)
# Prune linear layers
self.self.query = prune_linear_layer(self.self.query, index)
self.self.key = prune_linear_layer(self.self.key, index)
self.self.value = prune_linear_layer(self.self.value, index)
self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
# Update hyper params and store pruned heads
self.self.num_attention_heads = self.self.num_attention_heads - len(heads)
self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads
self.pruned_heads = self.pruned_heads.union(heads)
def forward(self, hidden_states, height, width, output_attentions=False):
self_outputs = self.self(hidden_states, height, width, output_attentions)
attention_output = self.output(self_outputs[0], hidden_states)
outputs = (attention_output,) + self_outputs[1:] # add attentions if we output them
return outputs
# Copied from transformers.models.segformer.modeling_segformer.SegformerDWConv
class GLPNDWConv(nn.Module):
def __init__(self, dim=768):
super().__init__()
self.dwconv = nn.Conv2d(dim, dim, 3, 1, 1, bias=True, groups=dim)
def forward(self, hidden_states, height, width):
batch_size, seq_len, num_channels = hidden_states.shape
hidden_states = hidden_states.transpose(1, 2).view(batch_size, num_channels, height, width)
hidden_states = self.dwconv(hidden_states)
hidden_states = hidden_states.flatten(2).transpose(1, 2)
return hidden_states
# Copied from transformers.models.segformer.modeling_segformer.SegformerMixFFN with Segformer->GLPN
class GLPNMixFFN(nn.Module):
def __init__(self, config, in_features, hidden_features=None, out_features=None):
super().__init__()
out_features = out_features or in_features
self.dense1 = nn.Linear(in_features, hidden_features)
self.dwconv = GLPNDWConv(hidden_features)
if isinstance(config.hidden_act, str):
self.intermediate_act_fn = ACT2FN[config.hidden_act]
else:
self.intermediate_act_fn = config.hidden_act
self.dense2 = nn.Linear(hidden_features, out_features)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
def forward(self, hidden_states, height, width):
hidden_states = self.dense1(hidden_states)
hidden_states = self.dwconv(hidden_states, height, width)
hidden_states = self.intermediate_act_fn(hidden_states)
hidden_states = self.dropout(hidden_states)
hidden_states = self.dense2(hidden_states)
hidden_states = self.dropout(hidden_states)
return hidden_states
# Copied from transformers.models.segformer.modeling_segformer.SegformerLayer with Segformer->GLPN
class GLPNLayer(nn.Module):
"""This corresponds to the Block class in the original implementation."""
def __init__(self, config, hidden_size, num_attention_heads, drop_path, sequence_reduction_ratio, mlp_ratio):
super().__init__()
self.layer_norm_1 = nn.LayerNorm(hidden_size)
self.attention = GLPNAttention(
config,
hidden_size=hidden_size,
num_attention_heads=num_attention_heads,
sequence_reduction_ratio=sequence_reduction_ratio,
)
self.drop_path = GLPNDropPath(drop_path) if drop_path > 0.0 else nn.Identity()
self.layer_norm_2 = nn.LayerNorm(hidden_size)
mlp_hidden_size = int(hidden_size * mlp_ratio)
self.mlp = GLPNMixFFN(config, in_features=hidden_size, hidden_features=mlp_hidden_size)
def forward(self, hidden_states, height, width, output_attentions=False):
self_attention_outputs = self.attention(
self.layer_norm_1(hidden_states), # in GLPN, layernorm is applied before self-attention
height,
width,
output_attentions=output_attentions,
)
attention_output = self_attention_outputs[0]
outputs = self_attention_outputs[1:] # add self attentions if we output attention weights
# first residual connection (with stochastic depth)
attention_output = self.drop_path(attention_output)
hidden_states = attention_output + hidden_states
mlp_output = self.mlp(self.layer_norm_2(hidden_states), height, width)
# second residual connection (with stochastic depth)
mlp_output = self.drop_path(mlp_output)
layer_output = mlp_output + hidden_states
outputs = (layer_output,) + outputs
return outputs
class GLPNEncoder(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
# stochastic depth decay rule
dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths))]
# patch embeddings
embeddings = []
for i in range(config.num_encoder_blocks):
embeddings.append(
GLPNOverlapPatchEmbeddings(
patch_size=config.patch_sizes[i],
stride=config.strides[i],
num_channels=config.num_channels if i == 0 else config.hidden_sizes[i - 1],
hidden_size=config.hidden_sizes[i],
)
)
self.patch_embeddings = nn.ModuleList(embeddings)
# Transformer blocks
blocks = []
cur = 0
for i in range(config.num_encoder_blocks):
# each block consists of layers
layers = []
if i != 0:
cur += config.depths[i - 1]
for j in range(config.depths[i]):
layers.append(
GLPNLayer(
config,
hidden_size=config.hidden_sizes[i],
num_attention_heads=config.num_attention_heads[i],
drop_path=dpr[cur + j],
sequence_reduction_ratio=config.sr_ratios[i],
mlp_ratio=config.mlp_ratios[i],
)
)
blocks.append(nn.ModuleList(layers))
self.block = nn.ModuleList(blocks)
# Layer norms
self.layer_norm = nn.ModuleList(
[nn.LayerNorm(config.hidden_sizes[i]) for i in range(config.num_encoder_blocks)]
)
def forward(
self,
pixel_values,
output_attentions=False,
output_hidden_states=False,
return_dict=True,
):
all_hidden_states = () if output_hidden_states else None
all_self_attentions = () if output_attentions else None
batch_size = pixel_values.shape[0]
hidden_states = pixel_values
for idx, x in enumerate(zip(self.patch_embeddings, self.block, self.layer_norm)):
embedding_layer, block_layer, norm_layer = x
# first, obtain patch embeddings
hidden_states, height, width = embedding_layer(hidden_states)
# second, send embeddings through blocks
for i, blk in enumerate(block_layer):
layer_outputs = blk(hidden_states, height, width, output_attentions)
hidden_states = layer_outputs[0]
if output_attentions:
all_self_attentions = all_self_attentions + (layer_outputs[1],)
# third, apply layer norm
hidden_states = norm_layer(hidden_states)
# fourth, optionally reshape back to (batch_size, num_channels, height, width)
hidden_states = hidden_states.reshape(batch_size, height, width, -1).permute(0, 3, 1, 2).contiguous()
if output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states,)
if not return_dict:
return tuple(v for v in [hidden_states, all_hidden_states, all_self_attentions] if v is not None)
return BaseModelOutput(
last_hidden_state=hidden_states,
hidden_states=all_hidden_states,
attentions=all_self_attentions,
)
class GLPNPreTrainedModel(PreTrainedModel):
"""
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
models.
"""
config_class = GLPNConfig
base_model_prefix = "glpn"
main_input_name = "pixel_values"
# Copied from transformers.models.segformer.modeling_segformer.SegformerPreTrainedModel._init_weights
def _init_weights(self, module):
"""Initialize the weights"""
if isinstance(module, (nn.Linear, nn.Conv2d)):
# Slightly different from the TF version which uses truncated_normal for initialization
# cf https://github.com/pytorch/pytorch/pull/5617
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
if module.bias is not None:
module.bias.data.zero_()
elif isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
elif isinstance(module, nn.LayerNorm):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
GLPN_START_DOCSTRING = r"""
This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) sub-class. Use
it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and
behavior.
Parameters:
config ([`GLPNConfig`]): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
"""
GLPN_INPUTS_DOCSTRING = r"""
Args:
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
[`GLPNFeatureExtractor`]. See [`GLPNFeatureExtractor.__call__`] for details.
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
tensors for more detail.
output_hidden_states (`bool`, *optional*):
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
more detail.
return_dict (`bool`, *optional*):
Whether or not to return a [`~file_utils.ModelOutput`] instead of a plain tuple.
"""
@add_start_docstrings(
"The bare GLPN encoder (Mix-Transformer) outputting raw hidden-states without any specific head on top.",
GLPN_START_DOCSTRING,
)
class GLPNModel(GLPNPreTrainedModel):
# Copied from transformers.models.segformer.modeling_segformer.SegformerModel.__init__ with Segformer->GLPN
def __init__(self, config):
super().__init__(config)
self.config = config
# hierarchical Transformer encoder
self.encoder = GLPNEncoder(config)
# Initialize weights and apply final processing
self.post_init()
def _prune_heads(self, heads_to_prune):
"""
Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
class PreTrainedModel
"""
for layer, heads in heads_to_prune.items():
self.encoder.layer[layer].attention.prune_heads(heads)
@add_start_docstrings_to_model_forward(GLPN_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
@add_code_sample_docstrings(
processor_class=_FEAT_EXTRACTOR_FOR_DOC,
checkpoint=_CHECKPOINT_FOR_DOC,
output_type=BaseModelOutput,
config_class=_CONFIG_FOR_DOC,
modality="vision",
expected_output=_EXPECTED_OUTPUT_SHAPE,
)
# Copied from transformers.models.segformer.modeling_segformer.SegformerModel.forward
def forward(
self,
pixel_values: torch.FloatTensor,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, BaseModelOutput]:
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
encoder_outputs = self.encoder(
pixel_values,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
sequence_output = encoder_outputs[0]
if not return_dict:
return (sequence_output,) + encoder_outputs[1:]
return BaseModelOutput(
last_hidden_state=sequence_output,
hidden_states=encoder_outputs.hidden_states,
attentions=encoder_outputs.attentions,
)
class GLPNSelectiveFeatureFusion(nn.Module):
"""
Selective Feature Fusion module, as explained in the [paper](https://arxiv.org/abs/2201.07436) (section 3.4). This
module adaptively selects and integrates local and global features by attaining an attention map for each feature.
"""
def __init__(self, in_channel=64):
super().__init__()
self.convolutional_layer1 = nn.Sequential(
nn.Conv2d(in_channels=int(in_channel * 2), out_channels=in_channel, kernel_size=3, stride=1, padding=1),
nn.BatchNorm2d(in_channel),
nn.ReLU(),
)
self.convolutional_layer2 = nn.Sequential(
nn.Conv2d(in_channels=in_channel, out_channels=int(in_channel / 2), kernel_size=3, stride=1, padding=1),
nn.BatchNorm2d(int(in_channel / 2)),
nn.ReLU(),
)
self.convolutional_layer3 = nn.Conv2d(
in_channels=int(in_channel / 2), out_channels=2, kernel_size=3, stride=1, padding=1
)
self.sigmoid = nn.Sigmoid()
def forward(self, local_features, global_features):
# concatenate features along the channel dimension
features = torch.cat((local_features, global_features), dim=1)
# pass through convolutional layers
features = self.convolutional_layer1(features)
features = self.convolutional_layer2(features)
features = self.convolutional_layer3(features)
# apply sigmoid to get two-channel attention map
attn = self.sigmoid(features)
# construct hybrid features by adding element-wise
hybrid_features = local_features * attn[:, 0, :, :].unsqueeze(1) + global_features * attn[
:, 1, :, :
].unsqueeze(1)
return hybrid_features
class GLPNDecoderStage(nn.Module):
def __init__(self, in_channels, out_channels):
super().__init__()
should_skip = in_channels == out_channels
self.convolution = nn.Conv2d(in_channels, out_channels, kernel_size=1) if not should_skip else nn.Identity()
self.fusion = GLPNSelectiveFeatureFusion(out_channels)
self.upsample = nn.Upsample(scale_factor=2, mode="bilinear", align_corners=False)
def forward(self, hidden_state, residual=None):
hidden_state = self.convolution(hidden_state)
if residual is not None:
hidden_state = self.fusion(hidden_state, residual)
hidden_state = self.upsample(hidden_state)
return hidden_state
hidden_state = self.upsample(hidden_state)
return hidden_state
class GLPNDecoder(nn.Module):
def __init__(self, config):
super().__init__()
# we use features from end -> start
reserved_hidden_sizes = config.hidden_sizes[::-1]
out_channels = config.decoder_hidden_size
self.stages = nn.ModuleList(
[GLPNDecoderStage(hidden_size, out_channels) for hidden_size in reserved_hidden_sizes]
)
# don't fuse in first stage
self.stages[0].fusion = None
self.final_upsample = nn.Upsample(scale_factor=2, mode="bilinear", align_corners=False)
def forward(self, hidden_states: List[torch.Tensor]) -> List[torch.Tensor]:
stage_hidden_states = []
stage_hidden_state = None
for hidden_state, stage in zip(hidden_states[::-1], self.stages):
stage_hidden_state = stage(hidden_state, stage_hidden_state)
stage_hidden_states.append(stage_hidden_state)
stage_hidden_states[-1] = self.final_upsample(stage_hidden_state)
return stage_hidden_states
class SiLogLoss(nn.Module):
"""
Implements the Scale-invariant log scale loss [Eigen et al., 2014](https://arxiv.org/abs/1406.2283).
$$L=\frac{1}{n} \sum_{i} d_{i}^{2}-\frac{1}{2 n^{2}}\left(\sum_{i} d_{i}^{2}\right)$$ where $d_{i}=\log y_{i}-\log
y_{i}^{*}$.
"""
def __init__(self, lambd=0.5):
super().__init__()
self.lambd = lambd
def forward(self, pred, target):
valid_mask = (target > 0).detach()
diff_log = torch.log(target[valid_mask]) - torch.log(pred[valid_mask])
loss = torch.sqrt(torch.pow(diff_log, 2).mean() - self.lambd * torch.pow(diff_log.mean(), 2))
return loss
class GLPNDepthEstimationHead(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
channels = config.decoder_hidden_size
self.head = nn.Sequential(
nn.Conv2d(channels, channels, kernel_size=3, stride=1, padding=1),
nn.ReLU(inplace=False),
nn.Conv2d(channels, 1, kernel_size=3, stride=1, padding=1),
)
def forward(self, hidden_states: List[torch.Tensor]) -> torch.Tensor:
# use last features of the decoder
hidden_states = hidden_states[self.config.head_in_index]
hidden_states = self.head(hidden_states)
predicted_depth = torch.sigmoid(hidden_states) * self.config.max_depth
predicted_depth = predicted_depth.squeeze(dim=1)
return predicted_depth
@add_start_docstrings(
"""GLPN Model transformer with a lightweight depth estimation head on top e.g. for KITTI, NYUv2.""",
GLPN_START_DOCSTRING,
)
class GLPNForDepthEstimation(GLPNPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.glpn = GLPNModel(config)
self.decoder = GLPNDecoder(config)
self.head = GLPNDepthEstimationHead(config)
# Initialize weights and apply final processing
self.post_init()
@add_start_docstrings_to_model_forward(GLPN_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@replace_return_docstrings(output_type=DepthEstimatorOutput, config_class=_CONFIG_FOR_DOC)
def forward(
self,
pixel_values,
labels=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
):
r"""
labels (`torch.FloatTensor` of shape `(batch_size, height, width)`, *optional*):
Ground truth depth estimation maps for computing the loss.
Returns:
Examples:
```python
>>> from transformers import GLPNFeatureExtractor, GLPNForDepthEstimation
>>> from PIL import Image
>>> import requests
>>> feature_extractor = GLPNFeatureExtractor.from_pretrained("vinvino02/glpn-kitti")
>>> model = GLPNForDepthEstimation.from_pretrained("vinvino02/glpn-kitti")
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> inputs = feature_extractor(images=image, return_tensors="pt")
>>> outputs = model(**inputs)
>>> predicted_depth = outputs.predicted_depth # shape (batch_size, height, width)
```"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
outputs = self.glpn(
pixel_values,
output_attentions=output_attentions,
output_hidden_states=True, # we need the intermediate hidden states
return_dict=return_dict,
)
hidden_states = outputs.hidden_states if return_dict else outputs[1]
out = self.decoder(hidden_states)
predicted_depth = self.head(out)
loss = None
if labels is not None:
loss_fct = SiLogLoss()
loss = loss_fct(predicted_depth, labels)
if not return_dict:
if output_hidden_states:
output = (predicted_depth,) + outputs[1:]
else:
output = (predicted_depth,) + outputs[2:]
return ((loss,) + output) if loss is not None else output
return DepthEstimatorOutput(
loss=loss,
predicted_depth=predicted_depth,
hidden_states=outputs.hidden_states if output_hidden_states else None,
attentions=outputs.attentions,
)
......@@ -1866,6 +1866,30 @@ def load_tf_weights_in_funnel(*args, **kwargs):
requires_backends(load_tf_weights_in_funnel, ["torch"])
GLPN_PRETRAINED_MODEL_ARCHIVE_LIST = None
class GLPNForDepthEstimation(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class GLPNModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class GLPNPreTrainedModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
GPT2_PRETRAINED_MODEL_ARCHIVE_LIST = None
......
......@@ -52,6 +52,13 @@ class DetrFeatureExtractor(metaclass=DummyObject):
requires_backends(self, ["vision"])
class GLPNFeatureExtractor(metaclass=DummyObject):
_backends = ["vision"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["vision"])
class ImageGPTFeatureExtractor(metaclass=DummyObject):
_backends = ["vision"]
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment