transcription.md 11.3 KB
Newer Older
1
2
3
4
5
# Speech-to-Text (Transcription/Translation) Support

This document walks you through the steps to add support for speech-to-text (ASR) models to vLLM’s transcription and translation APIs by implementing [SupportsTranscription][vllm.model_executor.models.interfaces.SupportsTranscription].
Please refer to the [supported models](../../models/supported_models.md#transcription) for further guidance.

6
## Update the base vLLM model
7
8
9

It is assumed you have already implemented your model in vLLM according to the basic model guide. Extend your model with the [SupportsTranscription][vllm.model_executor.models.interfaces.SupportsTranscription] interface and implement the following class attributes and methods.

10
### `supported_languages` and `supports_transcription_only`
11

12
Declare supported languages and capabilities:
13

14
15
16
17
- The `supported_languages` mapping is validated at init time.
- Set `supports_transcription_only=True` if the model should not serve text generation (eg Whisper).

??? code "supported_languages and supports_transcription_only"
18

19
    ```python
20
    from typing import ClassVar, Mapping, Literal
21
22
23
24
    import numpy as np
    import torch
    from torch import nn

25
    from vllm.config import ModelConfig, SpeechToTextConfig
26
    from vllm.inputs import PromptType
27
28
29
30
31
32
33
34
35
    from vllm.model_executor.models.interfaces import SupportsTranscription
    
    class YourASRModel(nn.Module, SupportsTranscription):
        # Map of ISO 639-1 language codes to language names
        supported_languages: ClassVar[Mapping[str, str]] = {
            "en": "English",
            "it": "Italian",
            # ... add more as needed
        }
36
        
37
38
39
40
41
42
43
44
45
46
        # If your model only supports audio-conditioned generation
        # (no text-only generation), enable this flag.
        supports_transcription_only: ClassVar[bool] = True
    ```

Provide an ASR configuration via [get_speech_to_text_config][vllm.model_executor.models.interfaces.SupportsTranscription.get_speech_to_text_config].

This is for controlling general behavior of the API when serving your model:

??? code "get_speech_to_text_config()"
47

48
49
50
51
52
53
54
    ```python
    class YourASRModel(nn.Module, SupportsTranscription):
        ...

        @classmethod
        def get_speech_to_text_config(
            cls,
55
            model_config: ModelConfig,
56
57
58
59
60
61
62
63
64
65
66
67
68
            task_type: Literal["transcribe", "translate"],
        ) -> SpeechToTextConfig:
            return SpeechToTextConfig(
                sample_rate=16_000,
                max_audio_clip_s=30,
                # Set to None to disable server-side chunking if your
                # model/processor handles it already
                min_energy_split_window_size=None,
            )
    ```

See [Audio preprocessing and chunking](#audio-preprocessing-and-chunking) for what each field controls.

69
Implement the prompt construction via [get_generation_prompt][vllm.model_executor.models.interfaces.SupportsTranscription.get_generation_prompt]. The server builds a [SpeechToTextParams][vllm.config.speech_to_text.SpeechToTextParams] object that bundles the resampled waveform, task parameters, and request-specific options. Your model receives this single object and returns a valid [PromptType][vllm.inputs.llm.PromptType]. There are two common patterns:
70
71
72
73
74
75

#### Multimodal LLM with audio embeddings (e.g., Voxtral, Gemma3n)

Return a dict containing `multi_modal_data` with the audio, and either a `prompt` string or `prompt_token_ids`:

??? code "get_generation_prompt()"
76

77
    ```python
78
79
    from vllm.config.speech_to_text import SpeechToTextParams

80
81
82
83
84
85
    class YourASRModel(nn.Module, SupportsTranscription):
        ...

        @classmethod
        def get_generation_prompt(
            cls,
86
            stt_params: SpeechToTextParams,
87
        ) -> PromptType:
88
89
90
91
            audio = stt_params.audio
            stt_config = stt_params.stt_config
            task_type = stt_params.task_type

92
93
94
95
96
97
98
99
100
101
            task_word = "Transcribe" if task_type == "transcribe" else "Translate"
            prompt = (
                "<start_of_turn>user\n"
                f"{task_word} this audio: <audio_soft_token>"
                "<end_of_turn>\n<start_of_turn>model\n"
            )

            return {
                "multi_modal_data": {"audio": (audio, stt_config.sample_rate)},
                "prompt": prompt,
102
            }
103
104
105
106
107
108
109
110
111
    ```

    For further clarification on multi modal inputs, please refer to [Multi-Modal Inputs](../../features/multimodal_inputs.md).

#### Encoder–decoder audio-only (e.g., Whisper)

Return a dict with separate `encoder_prompt` and `decoder_prompt` entries:

??? code "get_generation_prompt()"
112

113
    ```python
114
115
    from vllm.config.speech_to_text import SpeechToTextParams

116
117
118
119
120
121
    class YourASRModel(nn.Module, SupportsTranscription):
        ...

        @classmethod
        def get_generation_prompt(
            cls,
122
            stt_params: SpeechToTextParams,
123
        ) -> PromptType:
124
125
126
127
128
129
            audio = stt_params.audio
            stt_config = stt_params.stt_config
            language = stt_params.language
            task_type = stt_params.task_type
            request_prompt = stt_params.request_prompt

130
131
132
133
134
135
136
137
            if language is None:
                raise ValueError("Language must be specified")

            prompt = {
                "encoder_prompt": {
                    "prompt": "",
                    "multi_modal_data": {
                        "audio": (audio, stt_config.sample_rate),
138
                    },
139
140
141
142
143
144
145
146
147
                },
                "decoder_prompt": (
                    (f"<|prev|>{request_prompt}" if request_prompt else "")
                    + f"<|startoftranscript|><|{language}|>"
                    + f"<|{task_type}|><|notimestamps|>"
                ),
            }
            return cast(PromptType, prompt)
    ```
148

149
### `validate_language` (optional)
150

151
Language validation via [validate_language][vllm.model_executor.models.interfaces.SupportsTranscription.validate_language]
152

153
154
155
If your model requires a language and you want a default, override this method (see Whisper):

??? code "validate_language()"
156

157
158
    ```python
    @classmethod
159
    def validate_language(cls, language: str | None) -> str | None:
160
161
        if language is None:
            logger.warning(
162
163
164
165
                "Defaulting to language='en'. If you wish to transcribe "
                "audio in a different language, pass the `language` field "
                "in the TranscriptionRequest."
            )
166
167
168
169
            language = "en"
        return super().validate_language(language)
    ```

170
171
172
### `get_num_audio_tokens` (optional)

Token accounting for streaming via [get_num_audio_tokens][vllm.model_executor.models.interfaces.SupportsTranscription.get_num_audio_tokens]
173

174
Provide a fast duration→token estimate to improve streaming usage statistics:
175

176
??? code "get_num_audio_tokens()"
177

178
179
180
    ```python
    class YourASRModel(nn.Module, SupportsTranscription):
        ...
181

182
183
184
185
186
        @classmethod
        def get_num_audio_tokens(
            cls,
            audio_duration_s: float,
            stt_config: SpeechToTextConfig,
187
            model_config: ModelConfig,
188
        ) -> int | None:
189
190
191
            # Return None if unknown; otherwise return an estimate.
            return int(audio_duration_s * stt_config.sample_rate // 320)  # example
    ```
192

193
## Audio preprocessing and chunking
194
195
196

The API server takes care of basic audio I/O and optional chunking before building prompts:

197
- Resampling: Input audio is resampled to `SpeechToTextConfig.sample_rate` using `AudioResampler`.
198
199
200
201
- Chunking: If `SpeechToTextConfig.allow_audio_chunking` is True and the duration exceeds `max_audio_clip_s`, the server splits the audio into overlapping chunks and generates a prompt per chunk. Overlap is controlled by `overlap_chunk_second`.
- Energy-aware splitting: When `min_energy_split_window_size` is set, the server finds low-energy regions to minimize cutting within words.

Relevant server logic:
202
203

??? code "_preprocess_speech_to_text()"
204

205
206
207
208
209
    ```python
    # vllm/entrypoints/openai/speech_to_text.py
    async def _preprocess_speech_to_text(...):
        language = self.model_cls.validate_language(request.language)
        ...
210
211
        y, sr = load_audio(bytes_, sr=self.asr_config.sample_rate)
        duration = get_audio_duration(y=y, sr=sr)
212
213
214
215
216
        do_split_audio = (self.asr_config.allow_audio_chunking
                        and duration > self.asr_config.max_audio_clip_s)
        chunks = [y] if not do_split_audio else self._split_audio(y, int(sr))
        prompts = []
        for chunk in chunks:
217
            stt_params = request.build_stt_params(
218
219
                audio=chunk,
                stt_config=self.asr_config,
220
                model_config=self.model_config,
221
222
                task_type=self.task_type,
            )
223
            prompt = self.model_cls.get_generation_prompt(stt_params)
224
225
226
227
            prompts.append(prompt)
        return prompts, duration
    ```

228
## Exposing tasks automatically
229

230
vLLM automatically advertises transcription support if your model implements the interface:
231
232
233
234
235
236
237
238

```python
if supports_transcription(model):
    if model.supports_transcription_only:
        return ["transcription"]
    supported_tasks.append("transcription")
```

239
When enabled, the server initializes the transcription and translation handlers:
240
241
242
243
244
245
246
247

```python
state.openai_serving_transcription = OpenAIServingTranscription(...) if "transcription" in supported_tasks else None
state.openai_serving_translation = OpenAIServingTranslation(...) if "transcription" in supported_tasks else None
```

No extra registration is required beyond having your model class available via the model registry and implementing `SupportsTranscription`.

248
## Examples in-tree
249

250
- Whisper encoder–decoder (audio-only): [vllm/model_executor/models/whisper.py](../../../vllm/model_executor/models/whisper.py)
251
- Voxtral decoder-only (audio embeddings + LLM): [vllm/model_executor/models/voxtral.py](../../../vllm/model_executor/models/voxtral.py). Make sure to have installed `mistral-common[audio]`.
252
- Gemma3n decoder-only with fixed instruction prompt: [vllm/model_executor/models/gemma3n_mm.py](../../../vllm/model_executor/models/gemma3n_mm.py)
253
- Qwen3-Omni multimodal with audio embeddings: [vllm/model_executor/models/qwen3_omni_moe_thinker.py](../../../vllm/model_executor/models/qwen3_omni_moe_thinker.py)
254

255
## Test with the API
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279

Once your model implements `SupportsTranscription`, you can test the endpoints (API mimics OpenAI):

- Transcription (ASR):

    ```bash
    curl -s -X POST \
      -H "Authorization: Bearer $VLLM_API_KEY" \
      -H "Content-Type: multipart/form-data" \
      -F "file=@/path/to/audio.wav" \
      -F "model=$MODEL_ID" \
      http://localhost:8000/v1/audio/transcriptions
    ```

- Translation (source → English unless otherwise supported):

    ```bash
    curl -s -X POST \
      -H "Authorization: Bearer $VLLM_API_KEY" \
      -H "Content-Type: multipart/form-data" \
      -F "file=@/path/to/audio.wav" \
      -F "model=$MODEL_ID" \
      http://localhost:8000/v1/audio/translations
    ```
280

281
Or check out more examples in [examples/online_serving](../../../examples/online_serving).
282
283

!!! note
284
285
286
    - If your model handles chunking internally (e.g., via its processor or encoder), set `min_energy_split_window_size=None` in the returned `SpeechToTextConfig` to disable server-side chunking.
    - Implementing `get_num_audio_tokens` improves accuracy of streaming usage metrics (`prompt_tokens`) without an extra forward pass.
    - For multilingual behavior, keep `supported_languages` aligned with actual model capabilities.