openai_compatible_server.md 22.5 KB
Newer Older
1
2
3
4
---
title: OpenAI-Compatible Server
---
[](){ #openai-compatible-server }
5

6
vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! This functionality lets you serve models and interact with them using an HTTP client.
7

8
In your terminal, you can [install](../getting_started/installation/README.md) vLLM, then start the server with the [`vllm serve`][serve-args] command. (You can also use our [Docker][deployment-docker] image.)
9

10
```bash
11
vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
12
13
```

14
To call the server, in your preferred text editor, create a script that uses an HTTP client. Include any messages that you want to send to the model. Then run that script. Below is an example script using the [official OpenAI Python client](https://github.com/openai/openai-python).
15

16
17
18
19
20
21
22
23
```python
from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

completion = client.chat.completions.create(
24
25
26
27
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
28
29
30
31
32
)

print(completion.choices[0].message)
```

33
34
35
!!! tip
    vLLM supports some parameters that are not supported by OpenAI, `top_k` for example.
    You can pass these parameters to vLLM using the OpenAI client in the `extra_body` parameter of your requests, i.e. `extra_body={"top_k": 50}` for `top_k`.
36

37
38
!!! warning
    By default, the server applies `generation_config.json` from the Hugging Face model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
39

40
    To disable this behavior, please pass `--generation-config vllm` when launching the server.
41

42
## Supported APIs
43

44
45
We currently support the following OpenAI APIs:

46
47
48
49
50
51
52
53
54
55
- [Completions API][completions-api] (`/v1/completions`)
    - Only applicable to [text generation models](../models/generative_models.md) (`--task generate`).
    - *Note: `suffix` parameter is not supported.*
- [Chat Completions API][chat-api] (`/v1/chat/completions`)
    - Only applicable to [text generation models](../models/generative_models.md) (`--task generate`) with a [chat template][chat-template].
    - *Note: `parallel_tool_calls` and `user` parameters are ignored.*
- [Embeddings API][embeddings-api] (`/v1/embeddings`)
    - Only applicable to [embedding models](../models/pooling_models.md) (`--task embed`).
- [Transcriptions API][transcriptions-api] (`/v1/audio/transcriptions`)
    - Only applicable to Automatic Speech Recognition (ASR) models (OpenAI Whisper) (`--task generate`).
56

57
In addition, we have the following custom APIs:
58

59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
- [Tokenizer API][tokenizer-api] (`/tokenize`, `/detokenize`)
    - Applicable to any model with a tokenizer.
- [Pooling API][pooling-api] (`/pooling`)
    - Applicable to all [pooling models](../models/pooling_models.md).
- [Classification API][classification-api] (`/classify`)
    - Only applicable to [classification models](../models/pooling_models.md) (`--task classify`).
- [Score API][score-api] (`/score`)
    - Applicable to embedding models and [cross-encoder models](../models/pooling_models.md) (`--task score`).
- [Re-rank API][rerank-api] (`/rerank`, `/v1/rerank`, `/v2/rerank`)
    - Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/)
    - Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank)
    - Jina and Cohere's APIs are very similar; Jina's includes extra information in the rerank endpoint's response.
    - Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`).

[](){ #chat-template }
74

75
## Chat Template
76

77
78
79
In order for the language model to support chat protocol, vLLM requires the model to include
a chat template in its tokenizer configuration. The chat template is a Jinja2 template that
specifies how are roles, messages, and other chat-specific tokens are encoded in the input.
80

81
An example chat template for `NousResearch/Meta-Llama-3-8B-Instruct` can be found [here](https://github.com/meta-llama/llama3?tab=readme-ov-file#instruction-tuned-models)
82

83
84
85
86
Some models do not provide a chat template even though they are instruction/chat fine-tuned. For those model,
you can manually specify their chat template in the `--chat-template` parameter with the file path to the chat
template, or the template in string form. Without a chat template, the server will not be able to process chat
and all chat requests will error.
87
88

```bash
89
vllm serve <model> --chat-template ./path-to-chat-template.jinja
90
91
```

92
vLLM community provides a set of chat templates for popular models. You can find them under the <gh-dir:examples> directory.
93

94
95
With the inclusion of multi-modal chat APIs, the OpenAI spec now accepts chat messages in a new format which specifies
both a `type` and a `text` field. An example is provided below:
96

97
98
```python
completion = client.chat.completions.create(
99
100
101
102
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": [{"type": "text", "text": "Classify this sentiment: vLLM is wonderful!"}]}
    ]
103
)
104
105
```

106
Most chat templates for LLMs expect the `content` field to be a string, but there are some newer models like
107
108
109
110
`meta-llama/Llama-Guard-3-1B` that expect the content to be formatted according to the OpenAI schema in the
request. vLLM provides best-effort support to detect this automatically, which is logged as a string like
*"Detected the chat template content format to be..."*, and internally converts incoming requests to match
the detected format, which can be one of:
111

112
- `"string"`: A string.
113
    - Example: `"Hello world"`
114
- `"openai"`: A list of dictionaries, similar to OpenAI schema.
115
    - Example: `[{"type": "text", "text": "Hello world!"}]`
116

117
118
If the result is not what you expect, you can set the `--chat-template-content-format` CLI argument
to override which format to use.
119

120
## Extra Parameters
121

122
123
124
125
126
127
vLLM supports a set of parameters that are not part of the OpenAI API.
In order to use them, you can pass them as extra parameters in the OpenAI client.
Or directly merge them into the JSON payload if you are using HTTP call directly.

```python
completion = client.chat.completions.create(
128
129
130
131
132
133
134
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
    ],
    extra_body={
        "guided_choice": ["positive", "negative"]
    }
135
136
137
)
```

138
## Extra HTTP Headers
139

140
Only `X-Request-Id` HTTP request header is supported for now. It can be enabled
141
with `--enable-request-id-headers`.
142
143
144
145

> Note that enablement of the headers can impact performance significantly at high QPS
> rates. We recommend implementing HTTP headers at the router level (e.g. via Istio),
> rather than within the vLLM layer for this reason.
146
> See [this PR](https://github.com/vllm-project/vllm/pull/11529) for more details.
147
148
149

```python
completion = client.chat.completions.create(
150
151
152
153
154
155
156
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
    ],
    extra_headers={
        "x-request-id": "sentiment-classification-00001",
    }
157
158
159
160
)
print(completion._request_id)

completion = client.completions.create(
161
162
163
164
165
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    prompt="A robot may not injure a human being",
    extra_headers={
        "x-request-id": "completion-test",
    }
166
167
168
169
)
print(completion._request_id)
```

170
171
## API Reference

172
[](){ #completions-api }
173

174
175
### Completions API

176
177
178
Our Completions API is compatible with [OpenAI's Completions API](https://platform.openai.com/docs/api-reference/completions);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.

179
Code example: <gh-file:examples/online_serving/openai_completion_client.py>
180
181

#### Extra parameters
182

183
The following [sampling parameters][sampling-params] are supported.
184

185
186
187
```python
--8<-- "vllm/entrypoints/openai/protocol.py:completion-sampling-params"
```
188
189
190

The following extra parameters are supported:

191
192
193
```python
--8<-- "vllm/entrypoints/openai/protocol.py:completion-extra-params"
```
194

195
[](){ #chat-api }
196

197
### Chat API
198

199
200
Our Chat API is compatible with [OpenAI's Chat Completions API](https://platform.openai.com/docs/api-reference/chat);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
201

202
203
We support both [Vision](https://platform.openai.com/docs/guides/vision)- and
[Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters;
204
see our [Multimodal Inputs][multimodal-inputs] guide for more information.
205
206
- *Note: `image_url.detail` parameter is not supported.*

207
Code example: <gh-file:examples/online_serving/openai_chat_completion_client.py>
208

209
#### Extra parameters
210

211
The following [sampling parameters][sampling-params] are supported.
212

213
214
215
```python
--8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-sampling-params"
```
216
217
218

The following extra parameters are supported:

219
220
221
```python
--8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-extra-params"
```
222

223
[](){ #embeddings-api }
224

225
226
### Embeddings API

227
228
Our Embeddings API is compatible with [OpenAI's Embeddings API](https://platform.openai.com/docs/api-reference/embeddings);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
229

230
If the model has a [chat template][chat-template], you can replace `inputs` with a list of `messages` (same schema as [Chat API][chat-api])
231
232
which will be treated as a single prompt to the model.

233
234
235
236
237
238
239
Code example: <gh-file:examples/online_serving/openai_embedding_client.py>

#### Multi-modal inputs

You can pass multi-modal inputs to embedding models by defining a custom chat template for the server
and passing a list of `messages` in the request. Refer to the examples below for illustration.

240
=== "VLM2Vec"
241

242
    To serve the model:
243

244
245
246
247
    ```bash
    vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
      --trust-remote-code --max-model-len 4096 --chat-template examples/template_vlm2vec.jinja
    ```
248

249
250
251
    !!! warning
        Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--task embed`
        to run this model in embedding mode instead of text generation mode.
252

253
254
        The custom chat template is completely different from the original one for this model,
        and can be found here: <gh-file:examples/template_vlm2vec.jinja>
255

256
    Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
257

258
259
    ```python
    import requests
260

261
    image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
262

263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
    response = requests.post(
        "http://localhost:8000/v1/embeddings",
        json={
            "model": "TIGER-Lab/VLM2Vec-Full",
            "messages": [{
                "role": "user",
                "content": [
                    {"type": "image_url", "image_url": {"url": image_url}},
                    {"type": "text", "text": "Represent the given image."},
                ],
            }],
            "encoding_format": "float",
        },
    )
    response.raise_for_status()
    response_json = response.json()
    print("Embedding output:", response_json["data"][0]["embedding"])
    ```
281

282
=== "DSE-Qwen2-MRL"
283

284
    To serve the model:
285

286
287
288
289
    ```bash
    vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
      --trust-remote-code --max-model-len 8192 --chat-template examples/template_dse_qwen2_vl.jinja
    ```
290

291
292
    !!! warning
        Like with VLM2Vec, we have to explicitly pass `--task embed`.
293

294
295
        Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
        by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja>
296

297
298
299
    !!! warning
        `MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code
        example below for details.
300
301

Full example: <gh-file:examples/online_serving/openai_chat_embedding_client_for_multimodal.py>
302

303
#### Extra parameters
304

305
The following [pooling parameters][pooling-params] are supported.
306

307
308
309
```python
--8<-- "vllm/entrypoints/openai/protocol.py:embedding-pooling-params"
```
310

311
The following extra parameters are supported by default:
312

313
314
315
```python
--8<-- "vllm/entrypoints/openai/protocol.py:embedding-extra-params"
```
316

317
For chat-like input (i.e. if `messages` is passed), these extra parameters are supported instead:
318

319
320
321
```python
--8<-- "vllm/entrypoints/openai/protocol.py:chat-embedding-extra-params"
```
322

323
[](){ #transcriptions-api }
324
325
326
327
328
329

### Transcriptions API

Our Transcriptions API is compatible with [OpenAI's Transcriptions API](https://platform.openai.com/docs/api-reference/audio/createTranscription);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.

330
331
!!! note
    To use the Transcriptions API, please install with extra audio dependencies using `pip install vllm[audio]`.
332

333
Code example: <gh-file:examples/online_serving/openai_transcription_client.py>
334
335
<!-- TODO: api enforced limits + uploading audios -->

336
337
#### Extra Parameters

338
The following [sampling parameters][sampling-params] are supported.
339

340
341
342
```python
--8<-- "vllm/entrypoints/openai/protocol.py:transcription-sampling-params"
```
343
344
345

The following extra parameters are supported:

346
347
348
```python
--8<-- "vllm/entrypoints/openai/protocol.py:transcription-extra-params"
```
349

350
[](){ #tokenizer-api }
351

352
### Tokenizer API
353

354
Our Tokenizer API is a simple wrapper over [HuggingFace-style tokenizers](https://huggingface.co/docs/transformers/en/main_classes/tokenizer).
355
356
357
358
359
It consists of two endpoints:

- `/tokenize` corresponds to calling `tokenizer.encode()`.
- `/detokenize` corresponds to calling `tokenizer.decode()`.

360
[](){ #pooling-api }
361

362
363
364
365
### Pooling API

Our Pooling API encodes input prompts using a [pooling model](../models/pooling_models.md) and returns the corresponding hidden states.

366
The input format is the same as [Embeddings API][embeddings-api], but the output data can contain an arbitrary nested list, not just a 1-D list of floats.
367

368
Code example: <gh-file:examples/online_serving/openai_pooling_client.py>
369

370
[](){ #classification-api }
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477

### Classification API

Our Classification API directly supports Hugging Face sequence-classification models such as [ai21labs/Jamba-tiny-reward-dev](https://huggingface.co/ai21labs/Jamba-tiny-reward-dev) and [jason9693/Qwen2.5-1.5B-apeach](https://huggingface.co/jason9693/Qwen2.5-1.5B-apeach).

We automatically wrap any other transformer via `as_classification_model()`, which pools on the last token, attaches a `RowParallelLinear` head, and applies a softmax to produce per-class probabilities.

Code example: <gh-file:examples/online_serving/openai_classification_client.py>

#### Example Requests

You can classify multiple texts by passing an array of strings:

Request:

```bash
curl -v "http://127.0.0.1:8000/classify" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "jason9693/Qwen2.5-1.5B-apeach",
    "input": [
      "Loved the new café—coffee was great.",
      "This update broke everything. Frustrating."
    ]
  }'
```

Response:

```bash
{
  "id": "classify-7c87cac407b749a6935d8c7ce2a8fba2",
  "object": "list",
  "created": 1745383065,
  "model": "jason9693/Qwen2.5-1.5B-apeach",
  "data": [
    {
      "index": 0,
      "label": "Default",
      "probs": [
        0.565970778465271,
        0.4340292513370514
      ],
      "num_classes": 2
    },
    {
      "index": 1,
      "label": "Spoiled",
      "probs": [
        0.26448777318000793,
        0.7355121970176697
      ],
      "num_classes": 2
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "total_tokens": 20,
    "completion_tokens": 0,
    "prompt_tokens_details": null
  }
}
```

You can also pass a string directly to the `input` field:

Request:

```bash
curl -v "http://127.0.0.1:8000/classify" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "jason9693/Qwen2.5-1.5B-apeach",
    "input": "Loved the new café—coffee was great."
  }'
```

Response:

```bash
{
  "id": "classify-9bf17f2847b046c7b2d5495f4b4f9682",
  "object": "list",
  "created": 1745383213,
  "model": "jason9693/Qwen2.5-1.5B-apeach",
  "data": [
    {
      "index": 0,
      "label": "Default",
      "probs": [
        0.565970778465271,
        0.4340292513370514
      ],
      "num_classes": 2
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "total_tokens": 10,
    "completion_tokens": 0,
    "prompt_tokens_details": null
  }
}
```

#### Extra parameters

478
The following [pooling parameters][pooling-params] are supported.
479

480
481
482
```python
--8<-- "vllm/entrypoints/openai/protocol.py:classification-pooling-params"
```
483
484
485

The following extra parameters are supported:

486
487
488
```python
--8<-- "vllm/entrypoints/openai/protocol.py:classification-extra-params"
```
489

490
[](){ #score-api }
491

492
493
### Score API

494
Our Score API can apply a cross-encoder model or an embedding model to predict scores for sentence pairs. When using an embedding model the score corresponds to the cosine similarity between each embedding pair.
495
496
Usually, the score for a sentence pair refers to the similarity between two sentences, on a scale of 0 to 1.

497
You can find the documentation for cross encoder models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).
498

499
Code example: <gh-file:examples/online_serving/openai_cross_encoder_score.py>
500

501
502
503
504
505
#### Single inference

You can pass a string to both `text_1` and `text_2`, forming a single sentence pair.

Request:
506
507

```bash
508
509
510
511
512
513
514
515
516
517
curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "encoding_format": "float",
  "text_1": "What is the capital of France?",
  "text_2": "The capital of France is Paris."
}'
518
519
```

520
Response:
521

522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
```bash
{
  "id": "score-request-id",
  "object": "list",
  "created": 693447,
  "model": "BAAI/bge-reranker-v2-m3",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 1
    }
  ],
  "usage": {}
}
537
538
```

539
#### Batch inference
540

541
542
543
You can pass a string to `text_1` and a list to `text_2`, forming multiple sentence pairs
where each pair is built from `text_1` and a string in `text_2`.
The total number of pairs is `len(text_2)`.
544

545
Request:
546

547
548
549
550
551
552
553
554
555
556
557
558
559
560
```bash
curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "text_1": "What is the capital of France?",
  "text_2": [
    "The capital of Brazil is Brasilia.",
    "The capital of France is Paris."
  ]
}'
```
561

562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
Response:

```bash
{
  "id": "score-request-id",
  "object": "list",
  "created": 693570,
  "model": "BAAI/bge-reranker-v2-m3",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 0.001094818115234375
    },
    {
      "index": 1,
      "object": "score",
      "score": 1
    }
  ],
  "usage": {}
}
584
```
585

586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
You can pass a list to both `text_1` and `text_2`, forming multiple sentence pairs
where each pair is built from a string in `text_1` and the corresponding string in `text_2` (similar to `zip()`).
The total number of pairs is `len(text_2)`.

Request:

```bash
curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "encoding_format": "float",
  "text_1": [
    "What is the capital of Brazil?",
    "What is the capital of France?"
  ],
  "text_2": [
    "The capital of Brazil is Brasilia.",
    "The capital of France is Paris."
  ]
}'
```
610

611
Response:
612

613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
```bash
{
  "id": "score-request-id",
  "object": "list",
  "created": 693447,
  "model": "BAAI/bge-reranker-v2-m3",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 1
    },
    {
      "index": 1,
      "object": "score",
      "score": 1
    }
  ],
  "usage": {}
}
```
634

635
#### Extra parameters
636

637
The following [pooling parameters][pooling-params] are supported.
638

639
640
641
```python
--8<-- "vllm/entrypoints/openai/protocol.py:score-pooling-params"
```
642

643
644
The following extra parameters are supported:

645
646
647
```python
--8<-- "vllm/entrypoints/openai/protocol.py:score-extra-params"
```
648

649
[](){ #rerank-api }
650
651
652

### Re-rank API

653
Our Re-rank API can apply an embedding model or a cross-encoder model to predict relevant scores between a single query, and
654
655
656
each of a list of documents. Usually, the score for a sentence pair refers to the similarity between two sentences, on
a scale of 0 to 1.

657
You can find the documentation for cross encoder models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719

The rerank endpoints support popular re-rank models such as `BAAI/bge-reranker-base` and other models supporting the
`score` task. Additionally, `/rerank`, `/v1/rerank`, and `/v2/rerank`
endpoints are compatible with both [Jina AI's re-rank API interface](https://jina.ai/reranker/) and
[Cohere's re-rank API interface](https://docs.cohere.com/v2/reference/rerank) to ensure compatibility with
popular open-source tools.

Code example: <gh-file:examples/online_serving/jinaai_rerank_client.py>

#### Example Request

Note that the `top_n` request parameter is optional and will default to the length of the `documents` field.
Result documents will be sorted by relevance, and the `index` property can be used to determine original order.

Request:

```bash
curl -X 'POST' \
  'http://127.0.0.1:8000/v1/rerank' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-base",
  "query": "What is the capital of France?",
  "documents": [
    "The capital of Brazil is Brasilia.",
    "The capital of France is Paris.",
    "Horses and cows are both animals"
  ]
}'
```

Response:

```bash
{
  "id": "rerank-fae51b2b664d4ed38f5969b612edff77",
  "model": "BAAI/bge-reranker-base",
  "usage": {
    "total_tokens": 56
  },
  "results": [
    {
      "index": 1,
      "document": {
        "text": "The capital of France is Paris."
      },
      "relevance_score": 0.99853515625
    },
    {
      "index": 0,
      "document": {
        "text": "The capital of Brazil is Brasilia."
      },
      "relevance_score": 0.0005860328674316406
    }
  ]
}
```

#### Extra parameters

720
The following [pooling parameters][pooling-params] are supported.
721

722
723
724
```python
--8<-- "vllm/entrypoints/openai/protocol.py:rerank-pooling-params"
```
725
726
727

The following extra parameters are supported:

728
729
730
```python
--8<-- "vllm/entrypoints/openai/protocol.py:rerank-extra-params"
```