"tests/vscode:/vscode.git/clone" did not exist on "f48954a4beb1c0507132d97554358ba00fd83d0a"
openai_compatible_server.md 22.8 KB
Newer Older
1
# OpenAI-Compatible Server
2

3
vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! This functionality lets you serve models and interact with them using an HTTP client.
4

5
In your terminal, you can [install](../getting_started/installation/README.md) vLLM, then start the server with the [`vllm serve`](../configuration/serve_args.md) command. (You can also use our [Docker](../deployment/docker.md) image.)
6

7
```bash
Reid's avatar
Reid committed
8
9
10
vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
  --dtype auto \
  --api-key token-abc123
11
12
```

13
To call the server, in your preferred text editor, create a script that uses an HTTP client. Include any messages that you want to send to the model. Then run that script. Below is an example script using the [official OpenAI Python client](https://github.com/openai/openai-python).
14

15
??? code
16

17
18
19
20
21
22
    ```python
    from openai import OpenAI
    client = OpenAI(
        base_url="http://localhost:8000/v1",
        api_key="token-abc123",
    )
23

24
25
26
    completion = client.chat.completions.create(
        model="NousResearch/Meta-Llama-3-8B-Instruct",
        messages=[
27
28
            {"role": "user", "content": "Hello!"},
        ],
29
30
31
32
    )

    print(completion.choices[0].message)
    ```
33

34
35
36
!!! tip
    vLLM supports some parameters that are not supported by OpenAI, `top_k` for example.
    You can pass these parameters to vLLM using the OpenAI client in the `extra_body` parameter of your requests, i.e. `extra_body={"top_k": 50}` for `top_k`.
37

38
!!! important
39
    By default, the server applies `generation_config.json` from the Hugging Face model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
40

41
    To disable this behavior, please pass `--generation-config vllm` when launching the server.
42

43
## Supported APIs
44

45
46
We currently support the following OpenAI APIs:

47
- [Completions API](#completions-api) (`/v1/completions`)
48
    - Only applicable to [text generation models](../models/generative_models.md).
49
    - *Note: `suffix` parameter is not supported.*
50
51
- [Responses API](#responses-api) (`/v1/responses`)
    - Only applicable to [text generation models](../models/generative_models.md).
52
53
- [Chat Completions API](#chat-api) (`/v1/chat/completions`)
    - Only applicable to [text generation models](../models/generative_models.md) with a [chat template](../serving/openai_compatible_server.md#chat-template).
54
55
    - *Note: `user` parameter is ignored.*
    - *Note:* Setting the `parallel_tool_calls` parameter to `false` ensures vLLM only returns zero or one tool call per request. Setting it to `true` (the default) allows returning more than one tool call per request. There is no guarantee more than one tool call will be returned if this is set to `true`, as that behavior is model dependent and not all models are designed to support parallel tool calls.
56
57
- [Embeddings API](../models/pooling_models/embed.md#openai-compatible-embeddings-api) (`/v1/embeddings`)
    - Only applicable to [embedding models](../models/pooling_models/embed.md).
58
- [Transcriptions API](#transcriptions-api) (`/v1/audio/transcriptions`)
59
    - Only applicable to [Automatic Speech Recognition (ASR) models](../models/supported_models.md#transcription).
60
- [Translation API](#translations-api) (`/v1/audio/translations`)
61
    - Only applicable to [Automatic Speech Recognition (ASR) models](../models/supported_models.md#transcription).
62
63
- [Realtime API](#realtime-api) (`/v1/realtime`)
    - Only applicable to [Automatic Speech Recognition (ASR) models](../models/supported_models.md#transcription).
64

65
In addition, we have the following custom APIs:
66

67
- [Tokenizer API](#tokenizer-api) (`/tokenize`, `/detokenize`)
68
    - Applicable to any model with a tokenizer.
69
70
71
72
73
- [pooling API](../models/pooling_models/README.md#pooling-api) (`/pooling`)
    - Applicable to all [pooling models](../models/pooling_models/README.md).
- [Classification API](../models/pooling_models/classify.md#classification-api) (`/classify`)
    - Only applicable to [classification models](../models/pooling_models/classify.md).
- [Cohere Embed API](../models/pooling_models/embed.md#cohere-embed-api) (`/v2/embed`)
74
    - Compatible with [Cohere's Embed API](https://docs.cohere.com/reference/embed)
75
    - Works with any [embedding model](../models/pooling_models/embed.md#supported-models), including multimodal models.
Vedant V Jhaveri's avatar
Vedant V Jhaveri committed
76
77
78
79
80
- [Score API](../models/pooling_models/scoring.md#score-api) (`/score`, `/v1/score`)
    - Applicable to [score models](../models/pooling_models/scoring.md) (cross-encoder, bi-encoder, late-interaction).
- [Generative Scoring API](#generative-scoring-api) (`/generative_scoring`)
    - Applicable to [CausalLM models](../models/generative_models.md) (task `"generate"`).
    - Computes next-token probabilities for specified `label_token_ids`.
81
82
83
- [Rerank API](../models/pooling_models/scoring.md#rerank-api) (`/rerank`, `/v1/rerank`, `/v2/rerank`)
    - Implements [Jina AI's v1 rerank API](https://jina.ai/reranker/)
    - Also compatible with [Cohere's v1 & v2 rerank APIs](https://docs.cohere.com/v2/reference/rerank)
84
85
    - Jina and Cohere's APIs are very similar; Jina's includes extra information in the rerank endpoint's response.

86
## Chat Template
87

88
89
In order for the language model to support chat protocol, vLLM requires the model to include
a chat template in its tokenizer configuration. The chat template is a Jinja2 template that
90
specifies how roles, messages, and other chat-specific tokens are encoded in the input.
91

92
An example chat template for `NousResearch/Meta-Llama-3-8B-Instruct` can be found [here](https://llama.com/docs/model-cards-and-prompt-formats/meta-llama-3/#prompt-template-for-meta-llama-3)
93

94
Some models do not provide a chat template even though they are instruction/chat fine-tuned. For those models,
95
96
97
you can manually specify their chat template in the `--chat-template` parameter with the file path to the chat
template, or the template in string form. Without a chat template, the server will not be able to process chat
and all chat requests will error.
98
99

```bash
100
vllm serve <model> --chat-template ./path-to-chat-template.jinja
101
102
```

103
vLLM community provides a set of chat templates for popular models. You can find them under the [examples](../../examples) directory.
104

105
106
With the inclusion of multi-modal chat APIs, the OpenAI spec now accepts chat messages in a new format which specifies
both a `type` and a `text` field. An example is provided below:
107

108
109
```python
completion = client.chat.completions.create(
110
111
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
112
113
114
115
116
117
118
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Classify this sentiment: vLLM is wonderful!"},
            ],
        },
    ],
119
)
120
121
```

122
Most chat templates for LLMs expect the `content` field to be a string, but there are some newer models like
123
124
125
126
`meta-llama/Llama-Guard-3-1B` that expect the content to be formatted according to the OpenAI schema in the
request. vLLM provides best-effort support to detect this automatically, which is logged as a string like
*"Detected the chat template content format to be..."*, and internally converts incoming requests to match
the detected format, which can be one of:
127

128
- `"string"`: A string.
129
    - Example: `"Hello world"`
130
- `"openai"`: A list of dictionaries, similar to OpenAI schema.
131
    - Example: `[{"type": "text", "text": "Hello world!"}]`
132

133
134
If the result is not what you expect, you can set the `--chat-template-content-format` CLI argument
to override which format to use.
135

136
## Extra Parameters
137

138
139
140
141
142
143
vLLM supports a set of parameters that are not part of the OpenAI API.
In order to use them, you can pass them as extra parameters in the OpenAI client.
Or directly merge them into the JSON payload if you are using HTTP call directly.

```python
completion = client.chat.completions.create(
144
145
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
146
        {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"},
147
148
    ],
    extra_body={
149
150
        "structured_outputs": {"choice": ["positive", "negative"]},
    },
151
152
153
)
```

154
## Extra HTTP Headers
155

156
Only `X-Request-Id` HTTP request header is supported for now. It can be enabled
157
with `--enable-request-id-headers`.
158

159
??? code
160

161
162
163
164
    ```python
    completion = client.chat.completions.create(
        model="NousResearch/Meta-Llama-3-8B-Instruct",
        messages=[
165
            {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"},
166
167
168
        ],
        extra_headers={
            "x-request-id": "sentiment-classification-00001",
169
        },
170
171
172
173
174
175
176
177
    )
    print(completion._request_id)

    completion = client.completions.create(
        model="NousResearch/Meta-Llama-3-8B-Instruct",
        prompt="A robot may not injure a human being",
        extra_headers={
            "x-request-id": "completion-test",
178
        },
179
180
181
    )
    print(completion._request_id)
    ```
182

183
184
185
186
187
188
189
190
## Offline API Documentation

The FastAPI `/docs` endpoint requires an internet connection by default. To enable offline access in air-gapped environments, use the `--enable-offline-docs` flag:

```bash
vllm serve NousResearch/Meta-Llama-3-8B-Instruct --enable-offline-docs
```

191
192
193
194
## API Reference

### Completions API

195
196
197
Our Completions API is compatible with [OpenAI's Completions API](https://platform.openai.com/docs/api-reference/completions);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.

198
Code example: [examples/basic/online_serving/openai_completion_client.py](../../examples/basic/online_serving/openai_completion_client.py)
199
200

#### Extra parameters
201

202
The following [sampling parameters](../api/README.md#inference-parameters) are supported.
203

204
??? code
205
206

    ```python
207
    --8<-- "vllm/entrypoints/openai/completion/protocol.py:completion-sampling-params"
208
    ```
209
210
211

The following extra parameters are supported:

212
??? code
213
214

    ```python
215
    --8<-- "vllm/entrypoints/openai/completion/protocol.py:completion-extra-params"
216
    ```
217

218
### Chat API
219

220
221
Our Chat API is compatible with [OpenAI's Chat Completions API](https://platform.openai.com/docs/api-reference/chat);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
222

223
224
We support both [Vision](https://platform.openai.com/docs/guides/vision)- and
[Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters;
225
see our [Multimodal Inputs](../features/multimodal_inputs.md) guide for more information.
226

227
228
- *Note: `image_url.detail` parameter is not supported.*

229
Code example: [examples/basic/online_serving/openai_chat_completion_client.py](../../examples/basic/online_serving/openai_chat_completion_client.py)
230

231
#### Extra parameters
232

233
The following [sampling parameters](../api/README.md#inference-parameters) are supported.
234

235
??? code
236
237

    ```python
238
    --8<-- "vllm/entrypoints/openai/chat_completion/protocol.py:chat-completion-sampling-params"
239
    ```
240
241
242

The following extra parameters are supported:

243
??? code
244
245

    ```python
246
    --8<-- "vllm/entrypoints/openai/chat_completion/protocol.py:chat-completion-extra-params"
247
    ```
248

249
250
251
252
253
254
255
256
257
258
259
260
261
262
### Responses API

Our Responses API is compatible with [OpenAI's Responses API](https://platform.openai.com/docs/api-reference/responses);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.

Code example: [examples/online_serving/openai_responses_client_with_tools.py](../../examples/online_serving/openai_responses_client_with_tools.py)

#### Extra parameters

The following extra parameters in the request object are supported:

??? code

    ```python
263
    --8<-- "vllm/entrypoints/openai/responses/protocol.py:responses-extra-params"
264
265
266
267
268
269
270
    ```

The following extra parameters in the response object are supported:

??? code

    ```python
271
    --8<-- "vllm/entrypoints/openai/responses/protocol.py:responses-response-extra-params"
272
273
    ```

274
275
276
277
278
### Transcriptions API

Our Transcriptions API is compatible with [OpenAI's Transcriptions API](https://platform.openai.com/docs/api-reference/audio/createTranscription);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.

279
280
!!! note
    To use the Transcriptions API, please install with extra audio dependencies using `pip install vllm[audio]`.
281

282
Code example: [examples/online_serving/openai_transcription_client.py](../../examples/online_serving/openai_transcription_client.py)
283

284
285
NOTE: beam search is currently supported in the transcriptions endpoint for encoder-decoder multimodal models, e.g., whisper, but highly inefficient as work for handling the encoder/decoder cache is actively ongoing. This is an active point of ongoing optimization and will be handled properly in the very near future.

286
287
288
289
290
#### API Enforced Limits

Set the maximum audio file size (in MB) that VLLM will accept, via the
`VLLM_MAX_AUDIO_CLIP_FILESIZE_MB` environment variable. Default is 25 MB.

291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
#### Uploading Audio Files

The Transcriptions API supports uploading audio files in various formats including FLAC, MP3, MP4, MPEG, MPGA, M4A, OGG, WAV, and WEBM.

**Using OpenAI Python Client:**

??? code

    ```python
    from openai import OpenAI

    client = OpenAI(
        base_url="http://localhost:8000/v1",
        api_key="token-abc123",
    )

    # Upload audio file from disk
    with open("audio.mp3", "rb") as audio_file:
        transcription = client.audio.transcriptions.create(
            model="openai/whisper-large-v3-turbo",
            file=audio_file,
            language="en",
313
            response_format="verbose_json",
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
        )

    print(transcription.text)
    ```

**Using curl with multipart/form-data:**

??? code

    ```bash
    curl -X POST "http://localhost:8000/v1/audio/transcriptions" \
      -H "Authorization: Bearer token-abc123" \
      -F "file=@audio.mp3" \
      -F "model=openai/whisper-large-v3-turbo" \
      -F "language=en" \
      -F "response_format=verbose_json"
    ```

**Supported Parameters:**

- `file`: The audio file to transcribe (required)
- `model`: The model to use for transcription (required)
- `language`: The language code (e.g., "en", "zh") (optional)
- `prompt`: Optional text to guide the transcription style (optional)
- `response_format`: Format of the response ("json", "text") (optional)
- `temperature`: Sampling temperature between 0 and 1 (optional)

For the complete list of supported parameters including sampling parameters and vLLM extensions, see the [protocol definitions](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/protocol.py#L2182).

**Response Format:**

For `verbose_json` response format:

??? code

    ```json
    {
      "text": "Hello, this is a transcription of the audio file.",
      "language": "en",
      "duration": 5.42,
      "segments": [
        {
          "id": 0,
          "seek": 0,
          "start": 0.0,
          "end": 2.5,
          "text": "Hello, this is a transcription",
          "tokens": [50364, 938, 428, 307, 275, 28347],
          "temperature": 0.0,
          "avg_logprob": -0.245,
          "compression_ratio": 1.235,
          "no_speech_prob": 0.012
        }
      ]
    }
    ```
370
Currently “verbose_json” response format doesn’t support no_speech_prob.
371

372
373
#### Extra Parameters

374
The following [sampling parameters](../api/README.md#inference-parameters) are supported.
375

376
??? code
377
378

    ```python
379
    --8<-- "vllm/entrypoints/openai/speech_to_text/protocol.py:transcription-sampling-params"
380
    ```
381
382
383

The following extra parameters are supported:

384
??? code
385
386

    ```python
387
    --8<-- "vllm/entrypoints/openai/speech_to_text/protocol.py:transcription-extra-params"
388
    ```
389

390
391
392
393
394
395
396
397
398
399
### Translations API

Our Translation API is compatible with [OpenAI's Translations API](https://platform.openai.com/docs/api-reference/audio/createTranslation);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
Whisper models can translate audio from one of the 55 non-English supported languages into English.
Please mind that the popular `openai/whisper-large-v3-turbo` model does not support translating.

!!! note
    To use the Translation API, please install with extra audio dependencies using `pip install vllm[audio]`.

400
Code example: [examples/online_serving/openai_translation_client.py](../../examples/online_serving/openai_translation_client.py)
401
402
403

#### Extra Parameters

404
The following [sampling parameters](../api/README.md#inference-parameters) are supported.
405
406

```python
407
--8<-- "vllm/entrypoints/openai/speech_to_text/protocol.py:translation-sampling-params"
408
409
410
411
412
```

The following extra parameters are supported:

```python
413
--8<-- "vllm/entrypoints/openai/speech_to_text/protocol.py:translation-extra-params"
414
```
415

416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
### Realtime API

The Realtime API provides WebSocket-based streaming audio transcription, allowing real-time speech-to-text as audio is being recorded.

!!! note
    To use the Realtime API, please install with extra audio dependencies using `uv pip install vllm[audio]`.

#### Audio Format

Audio must be sent as base64-encoded PCM16 audio at 16kHz sample rate, mono channel.

#### Protocol Overview

1. Client connects to `ws://host/v1/realtime`
2. Server sends `session.created` event
3. Client optionally sends `session.update` with model/params
4. Client sends `input_audio_buffer.commit` when ready
5. Client sends `input_audio_buffer.append` events with base64 PCM16 chunks
6. Server sends `transcription.delta` events with incremental text
7. Server sends `transcription.done` with final text + usage
8. Repeat from step 5 for next utterance
9. Optionally, client sends input_audio_buffer.commit with final=True
    to signal audio input is finished. Useful when streaming audio files

#### Client → Server Events

| Event | Description |
443
| ----- | ----------- |
444
445
446
447
448
449
450
| `input_audio_buffer.append` | Send base64-encoded audio chunk: `{"type": "input_audio_buffer.append", "audio": "<base64>"}` |
| `input_audio_buffer.commit` | Trigger transcription processing or end: `{"type": "input_audio_buffer.commit", "final": bool}` |
| `session.update` | Configure session: `{"type": "session.update", "model": "model-name"}` |

#### Server → Client Events

| Event | Description |
451
| ----- | ----------- |
452
453
454
455
456
| `session.created` | Connection established with session ID and timestamp |
| `transcription.delta` | Incremental transcription text: `{"type": "transcription.delta", "delta": "text"}` |
| `transcription.done` | Final transcription with usage stats |
| `error` | Error notification with message and optional code |

457
#### Example Clients
458

459
460
- [openai_realtime_client.py](https://github.com/vllm-project/vllm/tree/main/examples/online_serving/openai_realtime_client.py) - Upload and transcribe an audio file
- [openai_realtime_microphone_client.py](https://github.com/vllm-project/vllm/tree/main/examples/online_serving/openai_realtime_microphone_client.py) - Gradio demo for live microphone transcription
461

462
### Tokenizer API
463

464
Our Tokenizer API is a simple wrapper over [HuggingFace-style tokenizers](https://huggingface.co/docs/transformers/en/main_classes/tokenizer).
465
466
467
468
469
It consists of two endpoints:

- `/tokenize` corresponds to calling `tokenizer.encode()`.
- `/detokenize` corresponds to calling `tokenizer.decode()`.

Vedant V Jhaveri's avatar
Vedant V Jhaveri committed
470
471
472
473
### Generative Scoring API

The `/generative_scoring` endpoint uses a CausalLM model (e.g., Llama, Qwen, Mistral) to compute the probability of specified token IDs appearing as the next token. Each item (document) is concatenated with the query to form a prompt, and the model predicts how likely each label token is as the next token after that prompt. This lets you score items against a query — for example, asking "Is this the capital of France?" and scoring each city by how likely the model is to answer "Yes".

474
This endpoint is automatically available when the server is started with a generative model (task `"generate"`). It is separate from the pooling-based [Score API](../models/pooling_models/scoring.md#score-api), which uses cross-encoder, bi-encoder, or late-interaction models.
Vedant V Jhaveri's avatar
Vedant V Jhaveri committed
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534

**Requirements:**

- The `label_token_ids` parameter is **required** and must contain **at least 1 token ID**.
- When 2 label tokens are provided, the score equals `P(label_token_ids[0]) / (P(label_token_ids[0]) + P(label_token_ids[1]))` (softmax over the two labels).
- When more labels are provided, the score is the softmax-normalized probability of the first label token across all label tokens.

#### Example

```bash
curl -X POST http://localhost:8000/generative_scoring \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "query": "Is this city the capital of France?",
    "items": ["Paris", "London", "Berlin"],
    "label_token_ids": [9454, 2753]
  }'
```

Here, each item is appended to the query to form prompts like `"Is this city the capital of France? Paris"`, `"... London"`, etc. The model then predicts the next token, and the score reflects the probability of "Yes" (token 9454) vs "No" (token 2753).

??? console "Response"

    ```json
    {
      "id": "generative-scoring-abc123",
      "object": "list",
      "created": 1234567890,
      "model": "Qwen/Qwen3-0.6B",
      "data": [
        {"index": 0, "object": "score", "score": 0.95},
        {"index": 1, "object": "score", "score": 0.12},
        {"index": 2, "object": "score", "score": 0.08}
      ],
      "usage": {"prompt_tokens": 45, "total_tokens": 48, "completion_tokens": 3}
    }
    ```

#### How it works

1. **Prompt Construction**: For each item, builds `prompt = query + item` (or `item + query` if `item_first=true`)
2. **Forward Pass**: Runs the model on each prompt to get next-token logits
3. **Probability Extraction**: Extracts logprobs for the specified `label_token_ids`
4. **Softmax Normalization**: Applies softmax over only the label tokens (when `apply_softmax=true`)
5. **Score**: Returns the normalized probability of the first label token

#### Finding Token IDs

To find the token IDs for your labels, use the tokenizer:

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
yes_id = tokenizer.encode("Yes", add_special_tokens=False)[0]
no_id = tokenizer.encode("No", add_special_tokens=False)[0]
print(f"Yes: {yes_id}, No: {no_id}")
```

535
536
537
538
539
540
541
542
543
544
## Ray Serve LLM

Ray Serve LLM enables scalable, production-grade serving of the vLLM engine. It integrates tightly with vLLM and extends it with features such as auto-scaling, load balancing, and back-pressure.

Key capabilities:

- Exposes an OpenAI-compatible HTTP API as well as a Pythonic API.
- Scales from a single GPU to a multi-node cluster without code changes.
- Provides observability and autoscaling policies through Ray dashboards and metrics.

545
The following example shows how to deploy a large model like DeepSeek R1 with Ray Serve LLM: [examples/online_serving/ray_serve_deepseek.py](../../examples/online_serving/ray_serve_deepseek.py).
546

547
Learn more about Ray Serve LLM with the official [Ray Serve LLM documentation](https://docs.ray.io/en/latest/serve/llm/index.html).