Unverified Commit b752ad30 authored by Eduardo Pacheco's avatar Eduardo Pacheco Committed by GitHub
Browse files

Adding grounding dino (#26087)



* Fixed typo when converting weigths to GroundingDINO vision backbone

* Final modifications on modeling

* Removed unnecessary class

* Fixed convert structure

* Added image processing

* make fixup partially completed

* Now text_backbone_config has its own class

* Modified convert script

* Removed unnecessary config attribute

* Added new function to generate sub sentence mask

* Renamed parameters with gamma in the name as it's currently not allowed

* Removed tokenization and image_processing scripts since we'll map from existing models

* Fixed some issues with configuration

* Just some modifications on conversion script

* Other modifications

* Copied deformable detr

* First commit

* Added bert to model

* Bert validated

* Created Text and Fusion layers for Encoder

* Adapted Encoder layer

* Fixed typos

* Adjusted Encoder

* Converted encoder to hf

* Modified Decoder Layer

* Modified main decoder class

* Removed copy comments

* Fixed forward from GroundingDINOModel and GroundingDINODecoder

* Added all necessary layers, configurations and forward logic up to GroundingDINOModel

* Added all layers to convertion

* Fixed outputs for GroundingDINOModel and GroundingDINOForObjectDetection

* Fixed mask input to encoders and fixed nn.MultiheadAttention batch first and attn output

* Fixed forward from GroundingDINOTextEnhancerLayer

* Fixed output bug with GroundingDINODeformableLayer

* Fixed bugs that prevent GroundingDINOForObjectDetection to run forward method

* Fixed attentions to be passed correctly

* Passing temperature arg when creating Sine position embedding

* Removed copy comments

* Added temperature argument for position embedding

* Fixed typo when converting weigths to GroundingDINO vision backbone

* Final modifications on modeling

* Removed unnecessary class

* Fixed convert structure

* Added image processing

* make fixup partially completed

* Now text_backbone_config has its own class

* Modified convert script

* Removed unnecessary config attribute

* Added new function to generate sub sentence mask

* Renamed parameters with gamma in the name as it's currently not allowed

* Removed tokenization and image_processing scripts since we'll map from existing models

* Fixed some issues with configuration

* Just some modifications on conversion script

* Other modifications

* Fix style

* Improve fixup

* Improve conversion script

* Improve conversion script

* Add GroundingDINOProcessor

* More improvements

* Return token type ids

* something

* Fix more tests

* More improvements

* More cleanup

* More improvements

* Fixed tests, improved modeling and config

* More improvements and fixing tests

* Improved tests and modeling

* Improved tests and added image processor

* Improved tests inference

* More improvements

* More test improvements

* Fixed last test

* Improved docstrings and comments

* Fix style

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py
Co-authored-by: default avatarRafael Padilla <31217453+rafaelpadilla@users.noreply.github.com>

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py
Co-authored-by: default avatarRafael Padilla <31217453+rafaelpadilla@users.noreply.github.com>

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py
Co-authored-by: default avatarRafael Padilla <31217453+rafaelpadilla@users.noreply.github.com>

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py
Co-authored-by: default avatarRafael Padilla <31217453+rafaelpadilla@users.noreply.github.com>

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py
Co-authored-by: default avatarRafael Padilla <31217453+rafaelpadilla@users.noreply.github.com>

* Better naming

* Better naming

* Added Copied statement

* Added Copied statement

* Moved param init from GroundingDINOBiMultiHeadAttention

* Better naming

* Fixing clamp style

* Better naming

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py
Co-authored-by: default avatarNielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py
Co-authored-by: default avatarNielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Update src/transformers/models/grounding_dino/configuration_grounding_dino.py
Co-authored-by: default avatarRafael Padilla <31217453+rafaelpadilla@users.noreply.github.com>

* Update src/transformers/models/grounding_dino/convert_grounding_dino_to_hf.py
Co-authored-by: default avatarRafael Padilla <31217453+rafaelpadilla@users.noreply.github.com>

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py
Co-authored-by: default avatarRafael Padilla <31217453+rafaelpadilla@users.noreply.github.com>

* Improving conversion script

* Improved config

* Improved naming

* Improved naming again

* Improved grouding-dino.md

* Moved grounding dino to multimodal

* Update src/transformers/models/grounding_dino/convert_grounding_dino_to_hf.py
Co-authored-by: default avatarRafael Padilla <31217453+rafaelpadilla@users.noreply.github.com>

* Fixed docstrings and style

* Fix docstrings

* Remove timm attributes

* Reorder imports

* More improvements

* Add Grounding DINO to pipeline

* Remove model from check_repo

* Added grounded post_process to GroundingDINOProcessor

* Fixed style

* Fixed GroundingDINOTextPrenetConfig docstrings

* Aligned inputs.keys() when both image and text are passed with model_input_names

* Added tests for GroundingDINOImageProcessor and GroundingDINOProcessor

* Testing post_process_grounded_object_detection from GroundingDINOProcessor at test_inference_object_detection_head

* Fixed order

* Marked test with require_torch

* Temporarily changed repo_id

* More improvements

* Fix style

* Final improvements

* Improve annotators

* Fix style

* Add is_torch_available

* Remove type hints

* vocab_tokens as one liner

* Removed print statements

* Renamed GroundingDINOTextPrenetConfig to GroundingDINOTextConfig

* remove unnecessary comments

* Removed unnecessary tests on conversion script

* Renamed GroundingDINO to camel case GroundingDino

* Fixed GroundingDinoProcessor docstrings

* loading MSDA kernels in the modeling file

* Fix copies

* Replace nn.multiheadattention

* Replace nn.multiheadattention

* Fixed inputs for GroundingDinoMultiheadAttention & order of modules

* Fixed processing to avoid messing with inputs

* Added more tips for GroundingDino

* Make style

* Chaning name to align with SAM

* Replace final nn.multiheadattention

* Fix model tests

* Update year, remove GenerationTesterMixin

* Address comments

* Address more comments

* Rename TextPrenet to TextModel

* Rename hidden_states

* Address more comments

* Address more comments

* Address comment

* Address more comments

* Address merge

* Address comment

* Address comment

* Address comment

* Make style

* Added layer norm eps to layer norms

* Address more comments

* More fixes

* Fixed equivalence

* Make fixup

* Remove print statements

* Address comments

* Address comments

* Address comments

* Address comments

* Address comments

* Address comments

* Add comment

* Address comment

* Remove overwriting of test

* Fix bbox_embed

* Improve decoder_bbox_embed_share

* Simplify outputs

* Updated post_process_grounded_object_detection

* Renamed sources to feature_maps

* Improved tests for Grounding Dino ImageProcessor and Processor

* Fixed test requirements and imports

* Fixed image_processing

* Fixed processor tests

* Fixed imports for image processing tests

* Fix copies

* Updated modeling

* Fix style

* Moved functions to correct position

* Fixed copy issues

* Update src/transformers/models/deformable_detr/modeling_deformable_detr.py
Co-authored-by: default avatarSangbum Daniel Choi <34004152+SangbumChoi@users.noreply.github.com>

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py
Co-authored-by: default avatarSangbum Daniel Choi <34004152+SangbumChoi@users.noreply.github.com>

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py
Co-authored-by: default avatarSangbum Daniel Choi <34004152+SangbumChoi@users.noreply.github.com>

* Keeping consistency custom cuda kernels for MSDA

* Make GroundingDinoProcessor logic clearer

* Updated Grounding DINO checkpoints

* Changed tests to correct structure

* Updated gpu-cpu equivalence test

* fix copies

* Update src/transformers/models/grounding_dino/processing_grounding_dino.py
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/models/grounding_dino/processing_grounding_dino.py
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/models/grounding_dino/configuration_grounding_dino.py
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Fixed erros and style

* Fix copies

* Removed inheritance from PreTrainedModel from GroundingDinoTextModel

* Fixed GroundingDinoTextModel

* Fixed type of default backbone config

* Fixed missing methods for GroundingDinoTextModel and Added timm support for GroundingDinoConvEncoder

* Addressed comments

* Addressed batched image processing tests

* Addressed zero shot test comment

* Addressed tip comment

* Removed GroundingDinoTextModel from check_repo

* Removed inplace masking

* Addressed comments

* Addressed comments

* Addressed comments

* Fix copies

* Fixing timm test

* Fixed batching equivalence test

* Update docs/source/en/model_doc/grounding-dino.md
Co-authored-by: default avatarTianqi Xu <40522713+dandansamax@users.noreply.github.com>

* Update docs/source/en/model_doc/grounding-dino.md
Co-authored-by: default avatarTianqi Xu <40522713+dandansamax@users.noreply.github.com>

* Update docs/source/en/model_doc/grounding-dino.md
Co-authored-by: default avatarTianqi Xu <40522713+dandansamax@users.noreply.github.com>

* Addressed more comments

* Added a new comment

* Reduced image size

* Addressed more comments

* Nits

* Nits

* Changed the way text_config is initialized

* Update src/transformers/models/grounding_dino/processing_grounding_dino.py
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

---------
Co-authored-by: default avatarNiels <niels.rogge1@gmail.com>
Co-authored-by: default avatarRafael Padilla <31217453+rafaelpadilla@users.noreply.github.com>
Co-authored-by: default avatarNielsRogge <48327001+NielsRogge@users.noreply.github.com>
Co-authored-by: default avatarEduardo Pacheco <eduardo.pacheco@limehome.com>
Co-authored-by: default avatarSangbum Daniel Choi <34004152+SangbumChoi@users.noreply.github.com>
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: default avatarTianqi Xu <40522713+dandansamax@users.noreply.github.com>
parent a5e5c92a
...@@ -115,6 +115,7 @@ MODEL_MAPPING_NAMES = OrderedDict( ...@@ -115,6 +115,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
("gptj", "GPTJModel"), ("gptj", "GPTJModel"),
("gptsan-japanese", "GPTSanJapaneseForConditionalGeneration"), ("gptsan-japanese", "GPTSanJapaneseForConditionalGeneration"),
("graphormer", "GraphormerModel"), ("graphormer", "GraphormerModel"),
("grounding-dino", "GroundingDinoModel"),
("groupvit", "GroupViTModel"), ("groupvit", "GroupViTModel"),
("hubert", "HubertModel"), ("hubert", "HubertModel"),
("ibert", "IBertModel"), ("ibert", "IBertModel"),
...@@ -753,6 +754,7 @@ MODEL_FOR_OBJECT_DETECTION_MAPPING_NAMES = OrderedDict( ...@@ -753,6 +754,7 @@ MODEL_FOR_OBJECT_DETECTION_MAPPING_NAMES = OrderedDict(
MODEL_FOR_ZERO_SHOT_OBJECT_DETECTION_MAPPING_NAMES = OrderedDict( MODEL_FOR_ZERO_SHOT_OBJECT_DETECTION_MAPPING_NAMES = OrderedDict(
[ [
# Model for Zero Shot Object Detection mapping # Model for Zero Shot Object Detection mapping
("grounding-dino", "GroundingDinoForObjectDetection"),
("owlv2", "Owlv2ForObjectDetection"), ("owlv2", "Owlv2ForObjectDetection"),
("owlvit", "OwlViTForObjectDetection"), ("owlvit", "OwlViTForObjectDetection"),
] ]
......
...@@ -195,6 +195,7 @@ else: ...@@ -195,6 +195,7 @@ else:
("gpt_neox_japanese", ("GPTNeoXJapaneseTokenizer", None)), ("gpt_neox_japanese", ("GPTNeoXJapaneseTokenizer", None)),
("gptj", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)), ("gptj", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
("gptsan-japanese", ("GPTSanJapaneseTokenizer", None)), ("gptsan-japanese", ("GPTSanJapaneseTokenizer", None)),
("grounding-dino", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
("groupvit", ("CLIPTokenizer", "CLIPTokenizerFast" if is_tokenizers_available() else None)), ("groupvit", ("CLIPTokenizer", "CLIPTokenizerFast" if is_tokenizers_available() else None)),
("herbert", ("HerbertTokenizer", "HerbertTokenizerFast" if is_tokenizers_available() else None)), ("herbert", ("HerbertTokenizer", "HerbertTokenizerFast" if is_tokenizers_available() else None)),
("hubert", ("Wav2Vec2CTCTokenizer", None)), ("hubert", ("Wav2Vec2CTCTokenizer", None)),
......
...@@ -710,13 +710,14 @@ class DeformableDetrMultiscaleDeformableAttention(nn.Module): ...@@ -710,13 +710,14 @@ class DeformableDetrMultiscaleDeformableAttention(nn.Module):
batch_size, num_queries, self.n_heads, self.n_levels, self.n_points batch_size, num_queries, self.n_heads, self.n_levels, self.n_points
) )
# batch_size, num_queries, n_heads, n_levels, n_points, 2 # batch_size, num_queries, n_heads, n_levels, n_points, 2
if reference_points.shape[-1] == 2: num_coordinates = reference_points.shape[-1]
if num_coordinates == 2:
offset_normalizer = torch.stack([spatial_shapes[..., 1], spatial_shapes[..., 0]], -1) offset_normalizer = torch.stack([spatial_shapes[..., 1], spatial_shapes[..., 0]], -1)
sampling_locations = ( sampling_locations = (
reference_points[:, :, None, :, None, :] reference_points[:, :, None, :, None, :]
+ sampling_offsets / offset_normalizer[None, None, None, :, None, :] + sampling_offsets / offset_normalizer[None, None, None, :, None, :]
) )
elif reference_points.shape[-1] == 4: elif num_coordinates == 4:
sampling_locations = ( sampling_locations = (
reference_points[:, :, None, :, None, :2] reference_points[:, :, None, :, None, :2]
+ sampling_offsets / self.n_points * reference_points[:, :, None, :, None, 2:] * 0.5 + sampling_offsets / self.n_points * reference_points[:, :, None, :, None, 2:] * 0.5
...@@ -1401,14 +1402,15 @@ class DeformableDetrDecoder(DeformableDetrPreTrainedModel): ...@@ -1401,14 +1402,15 @@ class DeformableDetrDecoder(DeformableDetrPreTrainedModel):
intermediate_reference_points = () intermediate_reference_points = ()
for idx, decoder_layer in enumerate(self.layers): for idx, decoder_layer in enumerate(self.layers):
if reference_points.shape[-1] == 4: num_coordinates = reference_points.shape[-1]
if num_coordinates == 4:
reference_points_input = ( reference_points_input = (
reference_points[:, :, None] * torch.cat([valid_ratios, valid_ratios], -1)[:, None] reference_points[:, :, None] * torch.cat([valid_ratios, valid_ratios], -1)[:, None]
) )
else: elif reference_points.shape[-1] == 2:
if reference_points.shape[-1] != 2:
raise ValueError("Reference points' last dimension must be of size 2")
reference_points_input = reference_points[:, :, None] * valid_ratios[:, None] reference_points_input = reference_points[:, :, None] * valid_ratios[:, None]
else:
raise ValueError("Reference points' last dimension must be of size 2")
if output_hidden_states: if output_hidden_states:
all_hidden_states += (hidden_states,) all_hidden_states += (hidden_states,)
...@@ -1442,17 +1444,18 @@ class DeformableDetrDecoder(DeformableDetrPreTrainedModel): ...@@ -1442,17 +1444,18 @@ class DeformableDetrDecoder(DeformableDetrPreTrainedModel):
# hack implementation for iterative bounding box refinement # hack implementation for iterative bounding box refinement
if self.bbox_embed is not None: if self.bbox_embed is not None:
tmp = self.bbox_embed[idx](hidden_states) tmp = self.bbox_embed[idx](hidden_states)
if reference_points.shape[-1] == 4: num_coordinates = reference_points.shape[-1]
if num_coordinates == 4:
new_reference_points = tmp + inverse_sigmoid(reference_points) new_reference_points = tmp + inverse_sigmoid(reference_points)
new_reference_points = new_reference_points.sigmoid() new_reference_points = new_reference_points.sigmoid()
else: elif num_coordinates == 2:
if reference_points.shape[-1] != 2:
raise ValueError(
f"Reference points' last dimension must be of size 2, but is {reference_points.shape[-1]}"
)
new_reference_points = tmp new_reference_points = tmp
new_reference_points[..., :2] = tmp[..., :2] + inverse_sigmoid(reference_points) new_reference_points[..., :2] = tmp[..., :2] + inverse_sigmoid(reference_points)
new_reference_points = new_reference_points.sigmoid() new_reference_points = new_reference_points.sigmoid()
else:
raise ValueError(
f"Last dim of reference_points must be 2 or 4, but got {reference_points.shape[-1]}"
)
reference_points = new_reference_points.detach() reference_points = new_reference_points.detach()
intermediate += (hidden_states,) intermediate += (hidden_states,)
......
...@@ -682,13 +682,14 @@ class DetaMultiscaleDeformableAttention(nn.Module): ...@@ -682,13 +682,14 @@ class DetaMultiscaleDeformableAttention(nn.Module):
batch_size, num_queries, self.n_heads, self.n_levels, self.n_points batch_size, num_queries, self.n_heads, self.n_levels, self.n_points
) )
# batch_size, num_queries, n_heads, n_levels, n_points, 2 # batch_size, num_queries, n_heads, n_levels, n_points, 2
if reference_points.shape[-1] == 2: num_coordinates = reference_points.shape[-1]
if num_coordinates == 2:
offset_normalizer = torch.stack([spatial_shapes[..., 1], spatial_shapes[..., 0]], -1) offset_normalizer = torch.stack([spatial_shapes[..., 1], spatial_shapes[..., 0]], -1)
sampling_locations = ( sampling_locations = (
reference_points[:, :, None, :, None, :] reference_points[:, :, None, :, None, :]
+ sampling_offsets / offset_normalizer[None, None, None, :, None, :] + sampling_offsets / offset_normalizer[None, None, None, :, None, :]
) )
elif reference_points.shape[-1] == 4: elif num_coordinates == 4:
sampling_locations = ( sampling_locations = (
reference_points[:, :, None, :, None, :2] reference_points[:, :, None, :, None, :2]
+ sampling_offsets / self.n_points * reference_points[:, :, None, :, None, 2:] * 0.5 + sampling_offsets / self.n_points * reference_points[:, :, None, :, None, 2:] * 0.5
......
# Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available, is_vision_available
_import_structure = {
"configuration_grounding_dino": [
"GROUNDING_DINO_PRETRAINED_CONFIG_ARCHIVE_MAP",
"GroundingDinoConfig",
],
"processing_grounding_dino": ["GroundingDinoProcessor"],
}
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["modeling_grounding_dino"] = [
"GROUNDING_DINO_PRETRAINED_MODEL_ARCHIVE_LIST",
"GroundingDinoForObjectDetection",
"GroundingDinoModel",
"GroundingDinoPreTrainedModel",
]
try:
if not is_vision_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["image_processing_grounding_dino"] = ["GroundingDinoImageProcessor"]
if TYPE_CHECKING:
from .configuration_grounding_dino import (
GROUNDING_DINO_PRETRAINED_CONFIG_ARCHIVE_MAP,
GroundingDinoConfig,
)
from .processing_grounding_dino import GroundingDinoProcessor
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .modeling_grounding_dino import (
GROUNDING_DINO_PRETRAINED_MODEL_ARCHIVE_LIST,
GroundingDinoForObjectDetection,
GroundingDinoModel,
GroundingDinoPreTrainedModel,
)
try:
if not is_vision_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .image_processing_grounding_dino import GroundingDinoImageProcessor
else:
import sys
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Grounding DINO model configuration"""
from ...configuration_utils import PretrainedConfig
from ...utils import logging
from ..auto import CONFIG_MAPPING
logger = logging.get_logger(__name__)
GROUNDING_DINO_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"IDEA-Research/grounding-dino-tiny": "https://huggingface.co/IDEA-Research/grounding-dino-tiny/resolve/main/config.json",
}
class GroundingDinoConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`GroundingDinoModel`]. It is used to instantiate a
Grounding DINO model according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the Grounding DINO
[IDEA-Research/grounding-dino-tiny](https://huggingface.co/IDEA-Research/grounding-dino-tiny) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
backbone_config (`PretrainedConfig` or `dict`, *optional*, defaults to `ResNetConfig()`):
The configuration of the backbone model.
backbone (`str`, *optional*):
Name of backbone to use when `backbone_config` is `None`. If `use_pretrained_backbone` is `True`, this
will load the corresponding pretrained weights from the timm or transformers library. If `use_pretrained_backbone`
is `False`, this loads the backbone's config and uses that to initialize the backbone with random weights.
use_pretrained_backbone (`bool`, *optional*, defaults to `False`):
Whether to use pretrained weights for the backbone.
use_timm_backbone (`bool`, *optional*, defaults to `False`):
Whether to load `backbone` from the timm library. If `False`, the backbone is loaded from the transformers
library.
backbone_kwargs (`dict`, *optional*):
Keyword arguments to be passed to AutoBackbone when loading from a checkpoint
e.g. `{'out_indices': (0, 1, 2, 3)}`. Cannot be specified if `backbone_config` is set.
text_config (`Union[AutoConfig, dict]`, *optional*, defaults to `BertConfig`):
The config object or dictionary of the text backbone.
num_queries (`int`, *optional*, defaults to 900):
Number of object queries, i.e. detection slots. This is the maximal number of objects
[`GroundingDinoModel`] can detect in a single image.
encoder_layers (`int`, *optional*, defaults to 6):
Number of encoder layers.
encoder_ffn_dim (`int`, *optional*, defaults to 2048):
Dimension of the "intermediate" (often named feed-forward) layer in decoder.
encoder_attention_heads (`int`, *optional*, defaults to 8):
Number of attention heads for each attention layer in the Transformer encoder.
decoder_layers (`int`, *optional*, defaults to 6):
Number of decoder layers.
decoder_ffn_dim (`int`, *optional*, defaults to 2048):
Dimension of the "intermediate" (often named feed-forward) layer in decoder.
decoder_attention_heads (`int`, *optional*, defaults to 8):
Number of attention heads for each attention layer in the Transformer decoder.
is_encoder_decoder (`bool`, *optional*, defaults to `True`):
Whether the model is used as an encoder/decoder or not.
activation_function (`str` or `function`, *optional*, defaults to `"relu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
`"relu"`, `"silu"` and `"gelu_new"` are supported.
d_model (`int`, *optional*, defaults to 256):
Dimension of the layers.
dropout (`float`, *optional*, defaults to 0.1):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for the attention probabilities.
activation_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for activations inside the fully connected layer.
auxiliary_loss (`bool`, *optional*, defaults to `False`):
Whether auxiliary decoding losses (loss at each decoder layer) are to be used.
position_embedding_type (`str`, *optional*, defaults to `"sine"`):
Type of position embeddings to be used on top of the image features. One of `"sine"` or `"learned"`.
num_feature_levels (`int`, *optional*, defaults to 4):
The number of input feature levels.
encoder_n_points (`int`, *optional*, defaults to 4):
The number of sampled keys in each feature level for each attention head in the encoder.
decoder_n_points (`int`, *optional*, defaults to 4):
The number of sampled keys in each feature level for each attention head in the decoder.
two_stage (`bool`, *optional*, defaults to `True`):
Whether to apply a two-stage deformable DETR, where the region proposals are also generated by a variant of
Grounding DINO, which are further fed into the decoder for iterative bounding box refinement.
class_cost (`float`, *optional*, defaults to 1.0):
Relative weight of the classification error in the Hungarian matching cost.
bbox_cost (`float`, *optional*, defaults to 5.0):
Relative weight of the L1 error of the bounding box coordinates in the Hungarian matching cost.
giou_cost (`float`, *optional*, defaults to 2.0):
Relative weight of the generalized IoU loss of the bounding box in the Hungarian matching cost.
bbox_loss_coefficient (`float`, *optional*, defaults to 5.0):
Relative weight of the L1 bounding box loss in the object detection loss.
giou_loss_coefficient (`float`, *optional*, defaults to 2.0):
Relative weight of the generalized IoU loss in the object detection loss.
focal_alpha (`float`, *optional*, defaults to 0.25):
Alpha parameter in the focal loss.
disable_custom_kernels (`bool`, *optional*, defaults to `False`):
Disable the use of custom CUDA and CPU kernels. This option is necessary for the ONNX export, as custom
kernels are not supported by PyTorch ONNX export.
max_text_len (`int`, *optional*, defaults to 256):
The maximum length of the text input.
text_enhancer_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for the text enhancer.
fusion_droppath (`float`, *optional*, defaults to 0.1):
The droppath ratio for the fusion module.
fusion_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for the fusion module.
embedding_init_target (`bool`, *optional*, defaults to `True`):
Whether to initialize the target with Embedding weights.
query_dim (`int`, *optional*, defaults to 4):
The dimension of the query vector.
decoder_bbox_embed_share (`bool`, *optional*, defaults to `True`):
Whether to share the bbox regression head for all decoder layers.
two_stage_bbox_embed_share (`bool`, *optional*, defaults to `False`):
Whether to share the bbox embedding between the two-stage bbox generator and the region proposal
generation.
positional_embedding_temperature (`float`, *optional*, defaults to 20):
The temperature for Sine Positional Embedding that is used together with vision backbone.
init_std (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (`float`, *optional*, defaults to 1e-05):
The epsilon used by the layer normalization layers.
Examples:
```python
>>> from transformers import GroundingDinoConfig, GroundingDinoModel
>>> # Initializing a Grounding DINO IDEA-Research/grounding-dino-tiny style configuration
>>> configuration = GroundingDinoConfig()
>>> # Initializing a model (with random weights) from the IDEA-Research/grounding-dino-tiny style configuration
>>> model = GroundingDinoModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "grounding-dino"
attribute_map = {
"hidden_size": "d_model",
"num_attention_heads": "encoder_attention_heads",
}
def __init__(
self,
backbone_config=None,
backbone=None,
use_pretrained_backbone=False,
use_timm_backbone=False,
backbone_kwargs=None,
text_config=None,
num_queries=900,
encoder_layers=6,
encoder_ffn_dim=2048,
encoder_attention_heads=8,
decoder_layers=6,
decoder_ffn_dim=2048,
decoder_attention_heads=8,
is_encoder_decoder=True,
activation_function="relu",
d_model=256,
dropout=0.1,
attention_dropout=0.0,
activation_dropout=0.0,
auxiliary_loss=False,
position_embedding_type="sine",
num_feature_levels=4,
encoder_n_points=4,
decoder_n_points=4,
two_stage=True,
class_cost=1.0,
bbox_cost=5.0,
giou_cost=2.0,
bbox_loss_coefficient=5.0,
giou_loss_coefficient=2.0,
focal_alpha=0.25,
disable_custom_kernels=False,
# other parameters
max_text_len=256,
text_enhancer_dropout=0.0,
fusion_droppath=0.1,
fusion_dropout=0.0,
embedding_init_target=True,
query_dim=4,
decoder_bbox_embed_share=True,
two_stage_bbox_embed_share=False,
positional_embedding_temperature=20,
init_std=0.02,
layer_norm_eps=1e-5,
**kwargs,
):
if not use_timm_backbone and use_pretrained_backbone:
raise ValueError(
"Loading pretrained backbone weights from the transformers library is not supported yet. `use_timm_backbone` must be set to `True` when `use_pretrained_backbone=True`"
)
if backbone_config is not None and backbone is not None:
raise ValueError("You can't specify both `backbone` and `backbone_config`.")
if backbone_config is None and backbone is None:
logger.info("`backbone_config` is `None`. Initializing the config with the default `Swin` backbone.")
backbone_config = CONFIG_MAPPING["swin"](
window_size=7,
image_size=224,
embed_dim=96,
depths=[2, 2, 6, 2],
num_heads=[3, 6, 12, 24],
out_indices=[2, 3, 4],
)
elif isinstance(backbone_config, dict):
backbone_model_type = backbone_config.pop("model_type")
config_class = CONFIG_MAPPING[backbone_model_type]
backbone_config = config_class.from_dict(backbone_config)
if backbone_kwargs is not None and backbone_kwargs and backbone_config is not None:
raise ValueError("You can't specify both `backbone_kwargs` and `backbone_config`.")
if text_config is None:
text_config = {}
logger.info("text_config is None. Initializing the text config with default values (`BertConfig`).")
self.backbone_config = backbone_config
self.backbone = backbone
self.use_pretrained_backbone = use_pretrained_backbone
self.use_timm_backbone = use_timm_backbone
self.backbone_kwargs = backbone_kwargs
self.num_queries = num_queries
self.d_model = d_model
self.encoder_ffn_dim = encoder_ffn_dim
self.encoder_layers = encoder_layers
self.encoder_attention_heads = encoder_attention_heads
self.decoder_ffn_dim = decoder_ffn_dim
self.decoder_layers = decoder_layers
self.decoder_attention_heads = decoder_attention_heads
self.dropout = dropout
self.attention_dropout = attention_dropout
self.activation_dropout = activation_dropout
self.activation_function = activation_function
self.auxiliary_loss = auxiliary_loss
self.position_embedding_type = position_embedding_type
# deformable attributes
self.num_feature_levels = num_feature_levels
self.encoder_n_points = encoder_n_points
self.decoder_n_points = decoder_n_points
self.two_stage = two_stage
# Hungarian matcher
self.class_cost = class_cost
self.bbox_cost = bbox_cost
self.giou_cost = giou_cost
# Loss coefficients
self.bbox_loss_coefficient = bbox_loss_coefficient
self.giou_loss_coefficient = giou_loss_coefficient
self.focal_alpha = focal_alpha
self.disable_custom_kernels = disable_custom_kernels
# Text backbone
if isinstance(text_config, dict):
text_config["model_type"] = text_config["model_type"] if "model_type" in text_config else "bert"
text_config = CONFIG_MAPPING[text_config["model_type"]](**text_config)
elif text_config is None:
text_config = CONFIG_MAPPING["bert"]()
self.text_config = text_config
self.max_text_len = max_text_len
# Text Enhancer
self.text_enhancer_dropout = text_enhancer_dropout
# Fusion
self.fusion_droppath = fusion_droppath
self.fusion_dropout = fusion_dropout
# Others
self.embedding_init_target = embedding_init_target
self.query_dim = query_dim
self.decoder_bbox_embed_share = decoder_bbox_embed_share
self.two_stage_bbox_embed_share = two_stage_bbox_embed_share
if two_stage_bbox_embed_share and not decoder_bbox_embed_share:
raise ValueError("If two_stage_bbox_embed_share is True, decoder_bbox_embed_share must be True.")
self.positional_embedding_temperature = positional_embedding_temperature
self.init_std = init_std
self.layer_norm_eps = layer_norm_eps
super().__init__(is_encoder_decoder=is_encoder_decoder, **kwargs)
@property
def num_attention_heads(self) -> int:
return self.encoder_attention_heads
@property
def hidden_size(self) -> int:
return self.d_model
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Processor class for Grounding DINO.
"""
from typing import List, Optional, Tuple, Union
from ...image_processing_utils import BatchFeature
from ...image_transforms import center_to_corners_format
from ...image_utils import ImageInput
from ...processing_utils import ProcessorMixin
from ...tokenization_utils_base import BatchEncoding, PaddingStrategy, PreTokenizedInput, TextInput, TruncationStrategy
from ...utils import TensorType, is_torch_available
if is_torch_available():
import torch
def get_phrases_from_posmap(posmaps, input_ids):
"""Get token ids of phrases from posmaps and input_ids.
Args:
posmaps (`torch.BoolTensor` of shape `(num_boxes, hidden_size)`):
A boolean tensor of text-thresholded logits related to the detected bounding boxes.
input_ids (`torch.LongTensor`) of shape `(sequence_length, )`):
A tensor of token ids.
"""
left_idx = 0
right_idx = posmaps.shape[-1] - 1
# Avoiding altering the input tensor
posmaps = posmaps.clone()
posmaps[:, 0 : left_idx + 1] = False
posmaps[:, right_idx:] = False
token_ids = []
for posmap in posmaps:
non_zero_idx = posmap.nonzero(as_tuple=True)[0].tolist()
token_ids.append([input_ids[i] for i in non_zero_idx])
return token_ids
class GroundingDinoProcessor(ProcessorMixin):
r"""
Constructs a Grounding DINO processor which wraps a Deformable DETR image processor and a BERT tokenizer into a
single processor.
[`GroundingDinoProcessor`] offers all the functionalities of [`GroundingDinoImageProcessor`] and
[`AutoTokenizer`]. See the docstring of [`~GroundingDinoProcessor.__call__`] and [`~GroundingDinoProcessor.decode`]
for more information.
Args:
image_processor (`GroundingDinoImageProcessor`):
An instance of [`GroundingDinoImageProcessor`]. The image processor is a required input.
tokenizer (`AutoTokenizer`):
An instance of ['PreTrainedTokenizer`]. The tokenizer is a required input.
"""
attributes = ["image_processor", "tokenizer"]
image_processor_class = "GroundingDinoImageProcessor"
tokenizer_class = "AutoTokenizer"
def __init__(self, image_processor, tokenizer):
super().__init__(image_processor, tokenizer)
def __call__(
self,
images: ImageInput = None,
text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
add_special_tokens: bool = True,
padding: Union[bool, str, PaddingStrategy] = False,
truncation: Union[bool, str, TruncationStrategy] = None,
max_length: Optional[int] = None,
stride: int = 0,
pad_to_multiple_of: Optional[int] = None,
return_attention_mask: Optional[bool] = None,
return_overflowing_tokens: bool = False,
return_special_tokens_mask: bool = False,
return_offsets_mapping: bool = False,
return_token_type_ids: bool = True,
return_length: bool = False,
verbose: bool = True,
return_tensors: Optional[Union[str, TensorType]] = None,
**kwargs,
) -> BatchEncoding:
"""
This method uses [`GroundingDinoImageProcessor.__call__`] method to prepare image(s) for the model, and
[`BertTokenizerFast.__call__`] to prepare text for the model.
Please refer to the docstring of the above two methods for more information.
"""
if images is None and text is None:
raise ValueError("You have to specify either images or text.")
# Get only text
if images is not None:
encoding_image_processor = self.image_processor(images, return_tensors=return_tensors)
else:
encoding_image_processor = BatchFeature()
if text is not None:
text_encoding = self.tokenizer(
text=text,
add_special_tokens=add_special_tokens,
padding=padding,
truncation=truncation,
max_length=max_length,
stride=stride,
pad_to_multiple_of=pad_to_multiple_of,
return_attention_mask=return_attention_mask,
return_overflowing_tokens=return_overflowing_tokens,
return_special_tokens_mask=return_special_tokens_mask,
return_offsets_mapping=return_offsets_mapping,
return_token_type_ids=return_token_type_ids,
return_length=return_length,
verbose=verbose,
return_tensors=return_tensors,
**kwargs,
)
else:
text_encoding = BatchEncoding()
text_encoding.update(encoding_image_processor)
return text_encoding
# Copied from transformers.models.blip.processing_blip.BlipProcessor.batch_decode with BertTokenizerFast->PreTrainedTokenizer
def batch_decode(self, *args, **kwargs):
"""
This method forwards all its arguments to PreTrainedTokenizer's [`~PreTrainedTokenizer.batch_decode`]. Please
refer to the docstring of this method for more information.
"""
return self.tokenizer.batch_decode(*args, **kwargs)
# Copied from transformers.models.blip.processing_blip.BlipProcessor.decode with BertTokenizerFast->PreTrainedTokenizer
def decode(self, *args, **kwargs):
"""
This method forwards all its arguments to PreTrainedTokenizer's [`~PreTrainedTokenizer.decode`]. Please refer to
the docstring of this method for more information.
"""
return self.tokenizer.decode(*args, **kwargs)
@property
# Copied from transformers.models.blip.processing_blip.BlipProcessor.model_input_names
def model_input_names(self):
tokenizer_input_names = self.tokenizer.model_input_names
image_processor_input_names = self.image_processor.model_input_names
return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
def post_process_grounded_object_detection(
self,
outputs,
input_ids,
box_threshold: float = 0.25,
text_threshold: float = 0.25,
target_sizes: Union[TensorType, List[Tuple]] = None,
):
"""
Converts the raw output of [`GroundingDinoForObjectDetection`] into final bounding boxes in (top_left_x, top_left_y,
bottom_right_x, bottom_right_y) format and get the associated text label.
Args:
outputs ([`GroundingDinoObjectDetectionOutput`]):
Raw outputs of the model.
input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
The token ids of the input text.
box_threshold (`float`, *optional*, defaults to 0.25):
Score threshold to keep object detection predictions.
text_threshold (`float`, *optional*, defaults to 0.25):
Score threshold to keep text detection predictions.
target_sizes (`torch.Tensor` or `List[Tuple[int, int]]`, *optional*):
Tensor of shape `(batch_size, 2)` or list of tuples (`Tuple[int, int]`) containing the target size
`(height, width)` of each image in the batch. If unset, predictions will not be resized.
Returns:
`List[Dict]`: A list of dictionaries, each dictionary containing the scores, labels and boxes for an image
in the batch as predicted by the model.
"""
logits, boxes = outputs.logits, outputs.pred_boxes
if target_sizes is not None:
if len(logits) != len(target_sizes):
raise ValueError(
"Make sure that you pass in as many target sizes as the batch dimension of the logits"
)
probs = torch.sigmoid(logits) # (batch_size, num_queries, 256)
scores = torch.max(probs, dim=-1)[0] # (batch_size, num_queries)
# Convert to [x0, y0, x1, y1] format
boxes = center_to_corners_format(boxes)
# Convert from relative [0, 1] to absolute [0, height] coordinates
if target_sizes is not None:
if isinstance(target_sizes, List):
img_h = torch.Tensor([i[0] for i in target_sizes])
img_w = torch.Tensor([i[1] for i in target_sizes])
else:
img_h, img_w = target_sizes.unbind(1)
scale_fct = torch.stack([img_w, img_h, img_w, img_h], dim=1).to(boxes.device)
boxes = boxes * scale_fct[:, None, :]
results = []
for idx, (s, b, p) in enumerate(zip(scores, boxes, probs)):
score = s[s > box_threshold]
box = b[s > box_threshold]
prob = p[s > box_threshold]
label_ids = get_phrases_from_posmap(prob > text_threshold, input_ids[idx])
label = self.batch_decode(label_ids)
results.append({"scores": score, "labels": label, "boxes": box})
return results
...@@ -4236,6 +4236,30 @@ class GraphormerPreTrainedModel(metaclass=DummyObject): ...@@ -4236,6 +4236,30 @@ class GraphormerPreTrainedModel(metaclass=DummyObject):
requires_backends(self, ["torch"]) requires_backends(self, ["torch"])
GROUNDING_DINO_PRETRAINED_MODEL_ARCHIVE_LIST = None
class GroundingDinoForObjectDetection(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class GroundingDinoModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class GroundingDinoPreTrainedModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
GROUPVIT_PRETRAINED_MODEL_ARCHIVE_LIST = None GROUPVIT_PRETRAINED_MODEL_ARCHIVE_LIST = None
......
...@@ -247,6 +247,13 @@ class GLPNImageProcessor(metaclass=DummyObject): ...@@ -247,6 +247,13 @@ class GLPNImageProcessor(metaclass=DummyObject):
requires_backends(self, ["vision"]) requires_backends(self, ["vision"])
class GroundingDinoImageProcessor(metaclass=DummyObject):
_backends = ["vision"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["vision"])
class IdeficsImageProcessor(metaclass=DummyObject): class IdeficsImageProcessor(metaclass=DummyObject):
_backends = ["vision"] _backends = ["vision"]
......
This diff is collapsed.
# Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import os
import shutil
import tempfile
import unittest
import numpy as np
import pytest
from transformers import BertTokenizer, BertTokenizerFast, GroundingDinoProcessor
from transformers.models.bert.tokenization_bert import VOCAB_FILES_NAMES
from transformers.testing_utils import require_torch, require_vision
from transformers.utils import IMAGE_PROCESSOR_NAME, is_torch_available, is_vision_available
if is_torch_available():
import torch
from transformers.models.grounding_dino.modeling_grounding_dino import GroundingDinoObjectDetectionOutput
if is_vision_available():
from PIL import Image
from transformers import GroundingDinoImageProcessor
@require_torch
@require_vision
class GroundingDinoProcessorTest(unittest.TestCase):
def setUp(self):
self.tmpdirname = tempfile.mkdtemp()
vocab_tokens = ["[UNK]","[CLS]","[SEP]","[PAD]","[MASK]","want","##want","##ed","wa","un","runn","##ing",",","low","lowest"] # fmt: skip
self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"])
with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer:
vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
image_processor_map = {
"do_resize": True,
"size": None,
"do_normalize": True,
"image_mean": [0.5, 0.5, 0.5],
"image_std": [0.5, 0.5, 0.5],
"do_rescale": True,
"rescale_factor": 1 / 255,
"do_pad": True,
}
self.image_processor_file = os.path.join(self.tmpdirname, IMAGE_PROCESSOR_NAME)
with open(self.image_processor_file, "w", encoding="utf-8") as fp:
json.dump(image_processor_map, fp)
self.batch_size = 7
self.num_queries = 5
self.embed_dim = 5
self.seq_length = 5
# Copied from tests.models.clip.test_processor_clip.CLIPProcessorTest.get_tokenizer with CLIP->Bert
def get_tokenizer(self, **kwargs):
return BertTokenizer.from_pretrained(self.tmpdirname, **kwargs)
# Copied from tests.models.clip.test_processor_clip.CLIPProcessorTest.get_rust_tokenizer with CLIP->Bert
def get_rust_tokenizer(self, **kwargs):
return BertTokenizerFast.from_pretrained(self.tmpdirname, **kwargs)
# Copied from tests.models.clip.test_processor_clip.CLIPProcessorTest.get_image_processor with CLIP->GroundingDino
def get_image_processor(self, **kwargs):
return GroundingDinoImageProcessor.from_pretrained(self.tmpdirname, **kwargs)
# Copied from tests.models.clip.test_processor_clip.CLIPProcessorTest.tearDown
def tearDown(self):
shutil.rmtree(self.tmpdirname)
# Copied from tests.models.clip.test_processor_clip.CLIPProcessorTest.prepare_image_inputs
def prepare_image_inputs(self):
"""This function prepares a list of PIL images, or a list of numpy arrays if one specifies numpify=True,
or a list of PyTorch tensors if one specifies torchify=True.
"""
image_inputs = [np.random.randint(255, size=(3, 30, 400), dtype=np.uint8)]
image_inputs = [Image.fromarray(np.moveaxis(x, 0, -1)) for x in image_inputs]
return image_inputs
def get_fake_grounding_dino_output(self):
torch.manual_seed(42)
return GroundingDinoObjectDetectionOutput(
pred_boxes=torch.rand(self.batch_size, self.num_queries, 4),
logits=torch.rand(self.batch_size, self.num_queries, self.embed_dim),
)
def get_fake_grounding_dino_input_ids(self):
input_ids = torch.tensor([101, 1037, 4937, 1012, 102])
return torch.stack([input_ids] * self.batch_size, dim=0)
def test_post_process_grounded_object_detection(self):
image_processor = self.get_image_processor()
tokenizer = self.get_tokenizer()
processor = GroundingDinoProcessor(tokenizer=tokenizer, image_processor=image_processor)
grounding_dino_output = self.get_fake_grounding_dino_output()
grounding_dino_input_ids = self.get_fake_grounding_dino_input_ids()
post_processed = processor.post_process_grounded_object_detection(
grounding_dino_output, grounding_dino_input_ids
)
self.assertEqual(len(post_processed), self.batch_size)
self.assertEqual(list(post_processed[0].keys()), ["scores", "labels", "boxes"])
self.assertEqual(post_processed[0]["boxes"].shape, (self.num_queries, 4))
self.assertEqual(post_processed[0]["scores"].shape, (self.num_queries,))
expected_scores = torch.tensor([0.7050, 0.7222, 0.7222, 0.6829, 0.7220])
self.assertTrue(torch.allclose(post_processed[0]["scores"], expected_scores, atol=1e-4))
expected_box_slice = torch.tensor([0.6908, 0.4354, 1.0737, 1.3947])
self.assertTrue(torch.allclose(post_processed[0]["boxes"][0], expected_box_slice, atol=1e-4))
# Copied from tests.models.clip.test_processor_clip.CLIPProcessorTest.test_save_load_pretrained_default with CLIP->GroundingDino,GroundingDinoTokenizer->BertTokenizer
def test_save_load_pretrained_default(self):
tokenizer_slow = self.get_tokenizer()
tokenizer_fast = self.get_rust_tokenizer()
image_processor = self.get_image_processor()
processor_slow = GroundingDinoProcessor(tokenizer=tokenizer_slow, image_processor=image_processor)
processor_slow.save_pretrained(self.tmpdirname)
processor_slow = GroundingDinoProcessor.from_pretrained(self.tmpdirname, use_fast=False)
processor_fast = GroundingDinoProcessor(tokenizer=tokenizer_fast, image_processor=image_processor)
processor_fast.save_pretrained(self.tmpdirname)
processor_fast = GroundingDinoProcessor.from_pretrained(self.tmpdirname)
self.assertEqual(processor_slow.tokenizer.get_vocab(), tokenizer_slow.get_vocab())
self.assertEqual(processor_fast.tokenizer.get_vocab(), tokenizer_fast.get_vocab())
self.assertEqual(tokenizer_slow.get_vocab(), tokenizer_fast.get_vocab())
self.assertIsInstance(processor_slow.tokenizer, BertTokenizer)
self.assertIsInstance(processor_fast.tokenizer, BertTokenizerFast)
self.assertEqual(processor_slow.image_processor.to_json_string(), image_processor.to_json_string())
self.assertEqual(processor_fast.image_processor.to_json_string(), image_processor.to_json_string())
self.assertIsInstance(processor_slow.image_processor, GroundingDinoImageProcessor)
self.assertIsInstance(processor_fast.image_processor, GroundingDinoImageProcessor)
# Copied from tests.models.clip.test_processor_clip.CLIPProcessorTest.test_save_load_pretrained_additional_features with CLIP->GroundingDino,GroundingDinoTokenizer->BertTokenizer
def test_save_load_pretrained_additional_features(self):
processor = GroundingDinoProcessor(tokenizer=self.get_tokenizer(), image_processor=self.get_image_processor())
processor.save_pretrained(self.tmpdirname)
tokenizer_add_kwargs = self.get_tokenizer(bos_token="(BOS)", eos_token="(EOS)")
image_processor_add_kwargs = self.get_image_processor(do_normalize=False, padding_value=1.0)
processor = GroundingDinoProcessor.from_pretrained(
self.tmpdirname, bos_token="(BOS)", eos_token="(EOS)", do_normalize=False, padding_value=1.0
)
self.assertEqual(processor.tokenizer.get_vocab(), tokenizer_add_kwargs.get_vocab())
self.assertIsInstance(processor.tokenizer, BertTokenizerFast)
self.assertEqual(processor.image_processor.to_json_string(), image_processor_add_kwargs.to_json_string())
self.assertIsInstance(processor.image_processor, GroundingDinoImageProcessor)
# Copied from tests.models.clip.test_processor_clip.CLIPProcessorTest.test_image_processor with CLIP->GroundingDino
def test_image_processor(self):
image_processor = self.get_image_processor()
tokenizer = self.get_tokenizer()
processor = GroundingDinoProcessor(tokenizer=tokenizer, image_processor=image_processor)
image_input = self.prepare_image_inputs()
input_image_proc = image_processor(image_input, return_tensors="np")
input_processor = processor(images=image_input, return_tensors="np")
for key in input_image_proc.keys():
self.assertAlmostEqual(input_image_proc[key].sum(), input_processor[key].sum(), delta=1e-2)
# Copied from tests.models.clip.test_processor_clip.CLIPProcessorTest.test_tokenizer with CLIP->GroundingDino
def test_tokenizer(self):
image_processor = self.get_image_processor()
tokenizer = self.get_tokenizer()
processor = GroundingDinoProcessor(tokenizer=tokenizer, image_processor=image_processor)
input_str = "lower newer"
encoded_processor = processor(text=input_str)
encoded_tok = tokenizer(input_str)
for key in encoded_tok.keys():
self.assertListEqual(encoded_tok[key], encoded_processor[key])
def test_processor(self):
image_processor = self.get_image_processor()
tokenizer = self.get_tokenizer()
processor = GroundingDinoProcessor(tokenizer=tokenizer, image_processor=image_processor)
input_str = "lower newer"
image_input = self.prepare_image_inputs()
inputs = processor(text=input_str, images=image_input)
self.assertListEqual(
list(inputs.keys()), ["input_ids", "token_type_ids", "attention_mask", "pixel_values", "pixel_mask"]
)
# test if it raises when no input is passed
with pytest.raises(ValueError):
processor()
# Copied from tests.models.clip.test_processor_clip.CLIPProcessorTest.test_tokenizer_decode with CLIP->GroundingDino
def test_tokenizer_decode(self):
image_processor = self.get_image_processor()
tokenizer = self.get_tokenizer()
processor = GroundingDinoProcessor(tokenizer=tokenizer, image_processor=image_processor)
predicted_ids = [[1, 4, 5, 8, 1, 0, 8], [3, 4, 3, 1, 1, 8, 9]]
decoded_processor = processor.batch_decode(predicted_ids)
decoded_tok = tokenizer.batch_decode(predicted_ids)
self.assertListEqual(decoded_tok, decoded_processor)
# Copied from tests.models.clip.test_processor_clip.CLIPProcessorTest.test_model_input_names with CLIP->GroundingDino
def test_model_input_names(self):
image_processor = self.get_image_processor()
tokenizer = self.get_tokenizer()
processor = GroundingDinoProcessor(tokenizer=tokenizer, image_processor=image_processor)
input_str = "lower newer"
image_input = self.prepare_image_inputs()
inputs = processor(text=input_str, images=image_input)
self.assertListEqual(list(inputs.keys()), processor.model_input_names)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment