@@ -356,6 +356,44 @@ You can pass a tuple `(array, sampling_rate)` to the `'audio'` field of the mult
...
@@ -356,6 +356,44 @@ You can pass a tuple `(array, sampling_rate)` to the `'audio'` field of the mult
Full example: [examples/offline_inference/audio_language.py](../../examples/offline_inference/audio_language.py)
Full example: [examples/offline_inference/audio_language.py](../../examples/offline_inference/audio_language.py)
#### Automatic Audio Channel Normalization
vLLM automatically normalizes audio channels for models that require specific audio formats. When loading audio with libraries like `torchaudio`, stereo files return shape `[channels, time]`, but many audio models (particularly Whisper-based models) expect mono audio with shape `[time]`.
**Supported models with automatic mono conversion:**
-**Whisper** and all Whisper-based models
-**Qwen2-Audio**
-**Qwen2.5-Omni** / **Qwen3-Omni** (inherits from Qwen2.5-Omni)
-**Ultravox**
For these models, vLLM automatically:
1. Detects if the model requires mono audio via the feature extractor
2. Converts multi-channel audio to mono using channel averaging
3. Handles both `(channels, time)` format (torchaudio) and `(time, channels)` format (soundfile)