- Uses causal attention for all tokens, including image tokens
- Uses causal attention for all tokens, including image tokens
- Generates reasonable outputs but does not match the original model's attention for text + image inputs, especially when `{"do_pan_and_scan": True}`
- Generates reasonable outputs but does not match the original model's attention for text + image inputs, especially when `{"do_pan_and_scan": true}`
- Will be updated in the future to support the correct behavior
- Will be updated in the future to support the correct behavior
This limitation exists because the model's mixed attention pattern (bidirectional for images, causal otherwise) is not yet supported by vLLM's attention backends.
This limitation exists because the model's mixed attention pattern (bidirectional for images, causal otherwise) is not yet supported by vLLM's attention backends.
...
@@ -1129,7 +1129,7 @@ To use Qwen2.5-Omni, you have to install Hugging Face Transformers library from
...
@@ -1129,7 +1129,7 @@ To use Qwen2.5-Omni, you have to install Hugging Face Transformers library from