Unverified Commit da71d444 authored by Cyrus Leung's avatar Cyrus Leung Committed by GitHub
Browse files

[Doc] Show that `use_audio_in_video` is supported in docs (#30837)


Signed-off-by: default avatarDarkLight1337 <tlleungac@connect.ust.hk>
parent 1fb0209b
...@@ -767,9 +767,6 @@ Some models are supported only via the [Transformers modeling backend](#transfor ...@@ -767,9 +767,6 @@ Some models are supported only via the [Transformers modeling backend](#transfor
The official `openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (`HwwwH/MiniCPM-V-2`) for now. The official `openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (`HwwwH/MiniCPM-V-2`) for now.
For more details, please see: <https://github.com/vllm-project/vllm/pull/4087#issuecomment-2250397630> For more details, please see: <https://github.com/vllm-project/vllm/pull/4087#issuecomment-2250397630>
!!! note
For Qwen2.5-Omni and Qwen3-Omni, reading audio from video pre-processing (`--mm-processor-kwargs '{"use_audio_in_video": true}'`) is currently work in progress and not yet supported.
#### Transcription #### Transcription
Speech2Text models trained specifically for Automatic Speech Recognition. Speech2Text models trained specifically for Automatic Speech Recognition.
......
...@@ -10,7 +10,6 @@ python examples/offline_inference/qwen2_5_omni/only_thinker.py \ ...@@ -10,7 +10,6 @@ python examples/offline_inference/qwen2_5_omni/only_thinker.py \
-q mixed_modalities -q mixed_modalities
# Read vision and audio inputs from a single video file # Read vision and audio inputs from a single video file
# NOTE: V1 engine does not support interleaved modalities yet.
python examples/offline_inference/qwen2_5_omni/only_thinker.py \ python examples/offline_inference/qwen2_5_omni/only_thinker.py \
-q use_audio_in_video -q use_audio_in_video
......
...@@ -1128,8 +1128,6 @@ class Qwen2_5OmniThinkerForConditionalGeneration( ...@@ -1128,8 +1128,6 @@ class Qwen2_5OmniThinkerForConditionalGeneration(
multimodal_embeddings += tuple(audio_embeddings) multimodal_embeddings += tuple(audio_embeddings)
return multimodal_embeddings return multimodal_embeddings
# TODO (ywang96): support overlapping modality embeddings so that
# `use_audio_in_video` will work on V1.
def embed_input_ids( def embed_input_ids(
self, self,
input_ids: torch.Tensor, input_ids: torch.Tensor,
......
...@@ -1371,8 +1371,6 @@ class Qwen3OmniMoeThinkerForConditionalGeneration( ...@@ -1371,8 +1371,6 @@ class Qwen3OmniMoeThinkerForConditionalGeneration(
return inputs_embeds return inputs_embeds
deepstack_input_embeds = None deepstack_input_embeds = None
# TODO (ywang96): support overlapping modalitiy embeddings so that
# `use_audio_in_video` will work on V1.
# split the feat dim to obtain multi-scale visual feature # split the feat dim to obtain multi-scale visual feature
has_vision_embeddings = [ has_vision_embeddings = [
embeddings.shape[-1] != self.config.text_config.hidden_size embeddings.shape[-1] != self.config.text_config.hidden_size
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment