Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
change
sglang
Commits
7b02c326
"git@developer.sourcefind.cn:norm/vllm.git" did not exist on "794e578de0f9657e169c713908a5fb8d25046b13"
Unverified
Commit
7b02c326
authored
May 23, 2025
by
Chang Su
Committed by
GitHub
May 23, 2025
Browse files
[Bugfix](gemma3_mm): handle flatten_batch constraint for multiple images (#6562)
parent
fefa19fe
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
15 additions
and
6 deletions
+15
-6
python/sglang/srt/models/gemma3_mm.py
python/sglang/srt/models/gemma3_mm.py
+15
-6
No files found.
python/sglang/srt/models/gemma3_mm.py
View file @
7b02c326
...
@@ -288,13 +288,22 @@ class Gemma3ForConditionalGeneration(PreTrainedModel):
...
@@ -288,13 +288,22 @@ class Gemma3ForConditionalGeneration(PreTrainedModel):
"MM inputs where only some items are precomputed."
"MM inputs where only some items are precomputed."
)
)
return
torch
.
concat
([
item
.
precomputed_features
for
item
in
items
])
return
torch
.
concat
([
item
.
precomputed_features
for
item
in
items
])
pixel_values
=
torch
.
stack
(
flatten_nested_list
([
item
.
pixel_values
for
item
in
items
]),
dim
=
0
)
pixel_values
=
pixel_values
.
to
(
device
=
self
.
vision_tower
.
device
)
pixel_values
=
pixel_values
.
to
(
dtype
=
self
.
language_model
.
dtype
())
vision_outputs
=
self
.
vision_tower
(
pixel_values
=
pixel_values
)
# Process images one by one to handle flatten_batch=True constraint in vision_tower
all_pixel_values
=
flatten_nested_list
([
item
.
pixel_values
for
item
in
items
])
vision_outputs_list
=
[]
for
pixel_value
in
all_pixel_values
:
# Add batch dimension for single image processing
pixel_value_batch
=
pixel_value
.
unsqueeze
(
0
)
pixel_value_batch
=
pixel_value_batch
.
to
(
device
=
self
.
vision_tower
.
device
)
pixel_value_batch
=
pixel_value_batch
.
to
(
dtype
=
self
.
language_model
.
dtype
())
vision_output
=
self
.
vision_tower
(
pixel_values
=
pixel_value_batch
)
vision_outputs_list
.
append
(
vision_output
)
# Concatenate all vision outputs
vision_outputs
=
torch
.
cat
(
vision_outputs_list
,
dim
=
0
)
image_features
=
self
.
multi_modal_projector
(
vision_outputs
)
image_features
=
self
.
multi_modal_projector
(
vision_outputs
)
return
image_features
return
image_features
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment