Unverified Commit 361ae27f authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

[Docs] Fix formatting of transcription doc (#24676)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent e26fef83
......@@ -3,14 +3,18 @@
This document walks you through the steps to add support for speech-to-text (ASR) models to vLLM’s transcription and translation APIs by implementing [SupportsTranscription][vllm.model_executor.models.interfaces.SupportsTranscription].
Please refer to the [supported models](../../models/supported_models.md#transcription) for further guidance.
## 1. Update the base vLLM model
## Update the base vLLM model
It is assumed you have already implemented your model in vLLM according to the basic model guide. Extend your model with the [SupportsTranscription][vllm.model_executor.models.interfaces.SupportsTranscription] interface and implement the following class attributes and methods.
- Declare supported languages and capabilities:
### `supported_languages` and `supports_transcription_only`
??? code
Declare supported languages and capabilities:
- The `supported_languages` mapping is validated at init time.
- Set `supports_transcription_only=True` if the model should not serve text generation (eg Whisper).
??? code "supported_languages and supports_transcription_only"
```python
from typing import ClassVar, Mapping, Optional, Literal
import numpy as np
......@@ -34,14 +38,11 @@ It is assumed you have already implemented your model in vLLM according to the b
supports_transcription_only: ClassVar[bool] = True
```
- The `supported_languages` mapping is validated at init time.
- Set `supports_transcription_only=True` if the model should not serve text generation (eg Whisper).
- Provide an ASR configuration via [get_speech_to_text_config][vllm.model_executor.models.interfaces.SupportsTranscription.get_speech_to_text_config].
This is for controlling general behavior of the API when serving your model:
Provide an ASR configuration via [get_speech_to_text_config][vllm.model_executor.models.interfaces.SupportsTranscription.get_speech_to_text_config].
??? code
This is for controlling general behavior of the API when serving your model:
??? code "get_speech_to_text_config()"
```python
class YourASRModel(nn.Module, SupportsTranscription):
...
......@@ -61,16 +62,15 @@ It is assumed you have already implemented your model in vLLM according to the b
)
```
See the “Audio preprocessing and chunking” section for what each field controls.
See [Audio preprocessing and chunking](#audio-preprocessing-and-chunking) for what each field controls.
- Implement the prompt construction via [get_generation_prompt][vllm.model_executor.models.interfaces.SupportsTranscription.get_generation_prompt]. The server passes you the resampled waveform and task parameters; you return a valid [PromptType][vllm.inputs.data.PromptType]. There are two common patterns:
Implement the prompt construction via [get_generation_prompt][vllm.model_executor.models.interfaces.SupportsTranscription.get_generation_prompt]. The server passes you the resampled waveform and task parameters; you return a valid [PromptType][vllm.inputs.data.PromptType]. There are two common patterns:
### A. Multimodal LLM with audio embeddings (e.g., Voxtral, Gemma3n)
#### Multimodal LLM with audio embeddings (e.g., Voxtral, Gemma3n)
Return a dict containing `multi_modal_data` with the audio, and either a `prompt` string or `prompt_token_ids`:
??? code
Return a dict containing `multi_modal_data` with the audio, and either a `prompt` string or `prompt_token_ids`:
??? code "get_generation_prompt()"
```python
class YourASRModel(nn.Module, SupportsTranscription):
...
......@@ -102,12 +102,11 @@ It is assumed you have already implemented your model in vLLM according to the b
For further clarification on multi modal inputs, please refer to [Multi-Modal Inputs](../../features/multimodal_inputs.md).
### B. Encoder–decoder audio-only (e.g., Whisper)
Return a dict with separate `encoder_prompt` and `decoder_prompt` entries:
#### Encoder–decoder audio-only (e.g., Whisper)
??? code
Return a dict with separate `encoder_prompt` and `decoder_prompt` entries:
??? code "get_generation_prompt()"
```python
class YourASRModel(nn.Module, SupportsTranscription):
...
......@@ -142,10 +141,13 @@ It is assumed you have already implemented your model in vLLM according to the b
return cast(PromptType, prompt)
```
- (Optional) Language validation via [validate_language][vllm.model_executor.models.interfaces.SupportsTranscription.validate_language]
### `validate_language` (optional)
If your model requires a language and you want a default, override this method (see Whisper):
Language validation via [validate_language][vllm.model_executor.models.interfaces.SupportsTranscription.validate_language]
If your model requires a language and you want a default, override this method (see Whisper):
??? code "validate_language()"
```python
@classmethod
def validate_language(cls, language: Optional[str]) -> Optional[str]:
......@@ -156,11 +158,13 @@ It is assumed you have already implemented your model in vLLM according to the b
return super().validate_language(language)
```
- (Optional) Token accounting for streaming via [get_num_audio_tokens][vllm.model_executor.models.interfaces.SupportsTranscription.get_num_audio_tokens]
### `get_num_audio_tokens` (optional)
Token accounting for streaming via [get_num_audio_tokens][vllm.model_executor.models.interfaces.SupportsTranscription.get_num_audio_tokens]
Provide a fast duration→token estimate to improve streaming usage statistics:
Provide a fast duration→token estimate to improve streaming usage statistics:
??? code
??? code "get_num_audio_tokens()"
```python
class YourASRModel(nn.Module, SupportsTranscription):
...
......@@ -176,7 +180,7 @@ It is assumed you have already implemented your model in vLLM according to the b
return int(audio_duration_s * stt_config.sample_rate // 320) # example
```
## 2. Audio preprocessing and chunking
## Audio preprocessing and chunking
The API server takes care of basic audio I/O and optional chunking before building prompts:
......@@ -185,7 +189,8 @@ The API server takes care of basic audio I/O and optional chunking before buildi
- Energy-aware splitting: When `min_energy_split_window_size` is set, the server finds low-energy regions to minimize cutting within words.
Relevant server logic:
??? code
??? code "_preprocess_speech_to_text()"
```python
# vllm/entrypoints/openai/speech_to_text.py
async def _preprocess_speech_to_text(...):
......@@ -211,9 +216,9 @@ Relevant server logic:
return prompts, duration
```
## 3. Exposing tasks automatically
## Exposing tasks automatically
- vLLM automatically advertises transcription support if your model implements the interface:
vLLM automatically advertises transcription support if your model implements the interface:
```python
if supports_transcription(model):
......@@ -222,7 +227,7 @@ if supports_transcription(model):
supported_tasks.append("transcription")
```
- When enabled, the server initializes the transcription and translation handlers:
When enabled, the server initializes the transcription and translation handlers:
```python
state.openai_serving_transcription = OpenAIServingTranscription(...) if "transcription" in supported_tasks else None
......@@ -231,13 +236,13 @@ state.openai_serving_translation = OpenAIServingTranslation(...) if "transcripti
No extra registration is required beyond having your model class available via the model registry and implementing `SupportsTranscription`.
## 4. Examples in-tree
## Examples in-tree
- Whisper encoder–decoder (audio-only): <gh-file:vllm/model_executor/models/whisper.py>
- Voxtral decoder-only (audio embeddings + LLM): <gh-file:vllm/model_executor/models/voxtral.py>
- Gemma3n decoder-only with fixed instruction prompt: <gh-file:vllm/model_executor/models/gemma3n_mm.py>
## 5. Test with the API
## Test with the API
Once your model implements `SupportsTranscription`, you can test the endpoints (API mimics OpenAI):
......@@ -266,7 +271,6 @@ Once your model implements `SupportsTranscription`, you can test the endpoints (
Or check out more examples in <gh-file:examples/online_serving>.
!!! note
- If your model handles chunking internally (e.g., via its processor or encoder), set `min_energy_split_window_size=None` in the returned `SpeechToTextConfig` to disable server-side chunking.
- Implementing `get_num_audio_tokens` improves accuracy of streaming usage metrics (`prompt_tokens`) without an extra forward pass.
- For multilingual behavior, keep `supported_languages` aligned with actual model capabilities.
- If your model handles chunking internally (e.g., via its processor or encoder), set `min_energy_split_window_size=None` in the returned `SpeechToTextConfig` to disable server-side chunking.
- Implementing `get_num_audio_tokens` improves accuracy of streaming usage metrics (`prompt_tokens`) without an extra forward pass.
- For multilingual behavior, keep `supported_languages` aligned with actual model capabilities.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment