Unverified Commit d95d6507 authored by Raushan Turganbay's avatar Raushan Turganbay Committed by GitHub
Browse files

[Bugfix] Fix getting vision features in Transformer Multimodal backend (#32933)


Signed-off-by: default avatarraushan <raushan@huggingface.co>
parent 13d8746c
...@@ -376,6 +376,15 @@ class MultiModalMixin(SupportsMultiModal, SupportsMRoPE): ...@@ -376,6 +376,15 @@ class MultiModalMixin(SupportsMultiModal, SupportsMRoPE):
pixel_values, **kwargs pixel_values, **kwargs
) )
# Transformers `v5`, `self.get_image_features` returns a tuple
# containing the features and optionally attentions/hidden_states
# After v5 is settled, we can enable qwen3-vl with several outputs
# from `self.get_image_features`
if isinstance(vision_embeddings, tuple):
vision_embeddings = vision_embeddings[0]
elif isinstance(vision_embeddings, dict):
vision_embeddings = vision_embeddings.pooler_output
if isinstance(vision_embeddings, torch.Tensor): if isinstance(vision_embeddings, torch.Tensor):
if vision_embeddings.ndim == 2: if vision_embeddings.ndim == 2:
vision_embeddings = vision_embeddings.unsqueeze(0) vision_embeddings = vision_embeddings.unsqueeze(0)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment