Commit 4c676e3d authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge tag 'v0.9.1' into v0.9.1-dev

parents b4c4464d b6553be1
(serving-llamaindex)=
# LlamaIndex
---
title: LlamaIndex
---
[](){ #serving-llamaindex }
vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index) .
......
---
title: Offline Inference
---
[](){ #offline-inference }
You can run vLLM in your own code on a list of prompts.
The offline API is based on the [LLM][vllm.LLM] class.
To initialize the vLLM engine, create a new instance of `LLM` and specify the model to run.
For example, the following code downloads the [`facebook/opt-125m`](https://huggingface.co/facebook/opt-125m) model from HuggingFace
and runs it in vLLM using the default configuration.
```python
from vllm import LLM
llm = LLM(model="facebook/opt-125m")
```
After initializing the `LLM` instance, you can perform model inference using various APIs.
The available APIs depend on the type of model that is being run:
- [Generative models][generative-models] output logprobs which are sampled from to obtain the final output text.
- [Pooling models][pooling-models] output their hidden states directly.
Please refer to the above pages for more details about each API.
!!! info
[API Reference][offline-inference-api]
(openai-compatible-server)=
# OpenAI-Compatible Server
---
title: OpenAI-Compatible Server
---
[](){ #openai-compatible-server }
vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! This functionality lets you serve models and interact with them using an HTTP client.
In your terminal, you can [install](../getting_started/installation.md) vLLM, then start the server with the [`vllm serve`](#vllm-serve) command. (You can also use our [Docker](#deployment-docker) image.)
In your terminal, you can [install](../getting_started/installation/README.md) vLLM, then start the server with the [`vllm serve`][serve-args] command. (You can also use our [Docker][deployment-docker] image.)
```bash
vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
--dtype auto \
--api-key token-abc123
```
To call the server, in your preferred text editor, create a script that uses an HTTP client. Include any messages that you want to send to the model. Then run that script. Below is an example script using the [official OpenAI Python client](https://github.com/openai/openai-python).
......@@ -20,56 +23,56 @@ client = OpenAI(
)
completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "Hello!"}
]
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "Hello!"}
]
)
print(completion.choices[0].message)
```
:::{tip}
vLLM supports some parameters that are not supported by OpenAI, `top_k` for example.
You can pass these parameters to vLLM using the OpenAI client in the `extra_body` parameter of your requests, i.e. `extra_body={"top_k": 50}` for `top_k`.
:::
!!! tip
vLLM supports some parameters that are not supported by OpenAI, `top_k` for example.
You can pass these parameters to vLLM using the OpenAI client in the `extra_body` parameter of your requests, i.e. `extra_body={"top_k": 50}` for `top_k`.
:::{important}
By default, the server applies `generation_config.json` from the Hugging Face model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
!!! warning
By default, the server applies `generation_config.json` from the Hugging Face model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
To disable this behavior, please pass `--generation-config vllm` when launching the server.
:::
To disable this behavior, please pass `--generation-config vllm` when launching the server.
## Supported APIs
We currently support the following OpenAI APIs:
- [Completions API](#completions-api) (`/v1/completions`)
- Only applicable to [text generation models](../models/generative_models.md) (`--task generate`).
- *Note: `suffix` parameter is not supported.*
- [Chat Completions API](#chat-api) (`/v1/chat/completions`)
- Only applicable to [text generation models](../models/generative_models.md) (`--task generate`) with a [chat template](#chat-template).
- *Note: `parallel_tool_calls` and `user` parameters are ignored.*
- [Embeddings API](#embeddings-api) (`/v1/embeddings`)
- Only applicable to [embedding models](../models/pooling_models.md) (`--task embed`).
- [Transcriptions API](#transcriptions-api) (`/v1/audio/transcriptions`)
- Only applicable to Automatic Speech Recognition (ASR) models (OpenAI Whisper) (`--task generate`).
- [Completions API][completions-api] (`/v1/completions`)
- Only applicable to [text generation models](../models/generative_models.md) (`--task generate`).
- *Note: `suffix` parameter is not supported.*
- [Chat Completions API][chat-api] (`/v1/chat/completions`)
- Only applicable to [text generation models](../models/generative_models.md) (`--task generate`) with a [chat template][chat-template].
- *Note: `parallel_tool_calls` and `user` parameters are ignored.*
- [Embeddings API][embeddings-api] (`/v1/embeddings`)
- Only applicable to [embedding models](../models/pooling_models.md) (`--task embed`).
- [Transcriptions API][transcriptions-api] (`/v1/audio/transcriptions`)
- Only applicable to Automatic Speech Recognition (ASR) models (OpenAI Whisper) (`--task generate`).
In addition, we have the following custom APIs:
- [Tokenizer API](#tokenizer-api) (`/tokenize`, `/detokenize`)
- Applicable to any model with a tokenizer.
- [Pooling API](#pooling-api) (`/pooling`)
- Applicable to all [pooling models](../models/pooling_models.md).
- [Score API](#score-api) (`/score`)
- Applicable to embedding models and [cross-encoder models](../models/pooling_models.md) (`--task score`).
- [Re-rank API](#rerank-api) (`/rerank`, `/v1/rerank`, `/v2/rerank`)
- Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/)
- Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank)
- Jina and Cohere's APIs are very similar; Jina's includes extra information in the rerank endpoint's response.
- Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`).
(chat-template)=
- [Tokenizer API][tokenizer-api] (`/tokenize`, `/detokenize`)
- Applicable to any model with a tokenizer.
- [Pooling API][pooling-api] (`/pooling`)
- Applicable to all [pooling models](../models/pooling_models.md).
- [Classification API][classification-api] (`/classify`)
- Only applicable to [classification models](../models/pooling_models.md) (`--task classify`).
- [Score API][score-api] (`/score`)
- Applicable to embedding models and [cross-encoder models](../models/pooling_models.md) (`--task score`).
- [Re-rank API][rerank-api] (`/rerank`, `/v1/rerank`, `/v2/rerank`)
- Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/)
- Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank)
- Jina and Cohere's APIs are very similar; Jina's includes extra information in the rerank endpoint's response.
- Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`).
[](){ #chat-template }
## Chat Template
......@@ -95,10 +98,10 @@ both a `type` and a `text` field. An example is provided below:
```python
completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": [{"type": "text", "text": "Classify this sentiment: vLLM is wonderful!"}]}
]
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": [{"type": "text", "text": "Classify this sentiment: vLLM is wonderful!"}]}
]
)
```
......@@ -109,9 +112,9 @@ request. vLLM provides best-effort support to detect this automatically, which i
the detected format, which can be one of:
- `"string"`: A string.
- Example: `"Hello world"`
- Example: `"Hello world"`
- `"openai"`: A list of dictionaries, similar to OpenAI schema.
- Example: `[{"type": "text", "text": "Hello world!"}]`
- Example: `[{"type": "text", "text": "Hello world!"}]`
If the result is not what you expect, you can set the `--chat-template-content-format` CLI argument
to override which format to use.
......@@ -124,13 +127,13 @@ Or directly merge them into the JSON payload if you are using HTTP call directly
```python
completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
],
extra_body={
"guided_choice": ["positive", "negative"]
}
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
],
extra_body={
"guided_choice": ["positive", "negative"]
}
)
```
......@@ -146,77 +149,29 @@ with `--enable-request-id-headers`.
```python
completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
],
extra_headers={
"x-request-id": "sentiment-classification-00001",
}
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
],
extra_headers={
"x-request-id": "sentiment-classification-00001",
}
)
print(completion._request_id)
completion = client.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
prompt="A robot may not injure a human being",
extra_headers={
"x-request-id": "completion-test",
}
model="NousResearch/Meta-Llama-3-8B-Instruct",
prompt="A robot may not injure a human being",
extra_headers={
"x-request-id": "completion-test",
}
)
print(completion._request_id)
```
## CLI Reference
(vllm-serve)=
### `vllm serve`
The `vllm serve` command is used to launch the OpenAI-compatible server.
:::{tip}
The vast majority of command-line arguments are based on those for offline inference.
See [here](configuration-options) for some common options.
:::
:::{argparse}
:module: vllm.entrypoints.openai.cli_args
:func: create_parser_for_docs
:prog: vllm serve
:::
#### Configuration file
You can load CLI arguments via a [YAML](https://yaml.org/) config file.
The argument names must be the long form of those outlined [above](#vllm-serve).
For example:
```yaml
# config.yaml
model: meta-llama/Llama-3.1-8B-Instruct
host: "127.0.0.1"
port: 6379
uvicorn-log-level: "info"
```
To use the above config file:
```bash
vllm serve --config config.yaml
```
:::{note}
In case an argument is supplied simultaneously using command line and the config file, the value from the command line will take precedence.
The order of priorities is `command line > config file values > defaults`.
e.g. `vllm serve SOME_MODEL --config config.yaml`, SOME_MODEL takes precedence over `model` in config file.
:::
## API Reference
(completions-api)=
[](){ #completions-api }
### Completions API
......@@ -227,23 +182,19 @@ Code example: <gh-file:examples/online_serving/openai_completion_client.py>
#### Extra parameters
The following [sampling parameters](#sampling-params) are supported.
The following [sampling parameters][sampling-params] are supported.
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-completion-sampling-params
:end-before: end-completion-sampling-params
:::
```python
--8<-- "vllm/entrypoints/openai/protocol.py:completion-sampling-params"
```
The following extra parameters are supported:
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-completion-extra-params
:end-before: end-completion-extra-params
:::
```python
--8<-- "vllm/entrypoints/openai/protocol.py:completion-extra-params"
```
(chat-api)=
[](){ #chat-api }
### Chat API
......@@ -252,37 +203,33 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai
We support both [Vision](https://platform.openai.com/docs/guides/vision)- and
[Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters;
see our [Multimodal Inputs](#multimodal-inputs) guide for more information.
see our [Multimodal Inputs][multimodal-inputs] guide for more information.
- *Note: `image_url.detail` parameter is not supported.*
Code example: <gh-file:examples/online_serving/openai_chat_completion_client.py>
#### Extra parameters
The following [sampling parameters](#sampling-params) are supported.
The following [sampling parameters][sampling-params] are supported.
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-chat-completion-sampling-params
:end-before: end-chat-completion-sampling-params
:::
```python
--8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-sampling-params"
```
The following extra parameters are supported:
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-chat-completion-extra-params
:end-before: end-chat-completion-extra-params
:::
```python
--8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-extra-params"
```
(embeddings-api)=
[](){ #embeddings-api }
### Embeddings API
Our Embeddings API is compatible with [OpenAI's Embeddings API](https://platform.openai.com/docs/api-reference/embeddings);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
If the model has a [chat template](#chat-template), you can replace `inputs` with a list of `messages` (same schema as [Chat API](#chat-api))
If the model has a [chat template][chat-template], you can replace `inputs` with a list of `messages` (same schema as [Chat API][chat-api])
which will be treated as a single prompt to the model.
Code example: <gh-file:examples/online_serving/openai_embedding_client.py>
......@@ -292,138 +239,121 @@ Code example: <gh-file:examples/online_serving/openai_embedding_client.py>
You can pass multi-modal inputs to embedding models by defining a custom chat template for the server
and passing a list of `messages` in the request. Refer to the examples below for illustration.
:::::{tab-set}
::::{tab-item} VLM2Vec
To serve the model:
=== "VLM2Vec"
```bash
vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
--trust-remote-code --max-model-len 4096 --chat-template examples/template_vlm2vec.jinja
```
To serve the model:
:::{important}
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--task embed`
to run this model in embedding mode instead of text generation mode.
```bash
vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
--trust-remote-code \
--max-model-len 4096 \
--chat-template examples/template_vlm2vec.jinja
```
The custom chat template is completely different from the original one for this model,
and can be found here: <gh-file:examples/template_vlm2vec.jinja>
:::
!!! warning
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--task embed`
to run this model in embedding mode instead of text generation mode.
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
The custom chat template is completely different from the original one for this model,
and can be found here: <gh-file:examples/template_vlm2vec.jinja>
```python
import requests
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
response = requests.post(
"http://localhost:8000/v1/embeddings",
json={
"model": "TIGER-Lab/VLM2Vec-Full",
"messages": [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": image_url}},
{"type": "text", "text": "Represent the given image."},
],
}],
"encoding_format": "float",
},
)
response.raise_for_status()
response_json = response.json()
print("Embedding output:", response_json["data"][0]["embedding"])
```
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
::::
```python
import requests
::::{tab-item} DSE-Qwen2-MRL
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
To serve the model:
response = requests.post(
"http://localhost:8000/v1/embeddings",
json={
"model": "TIGER-Lab/VLM2Vec-Full",
"messages": [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": image_url}},
{"type": "text", "text": "Represent the given image."},
],
}],
"encoding_format": "float",
},
)
response.raise_for_status()
response_json = response.json()
print("Embedding output:", response_json["data"][0]["embedding"])
```
```bash
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
--trust-remote-code --max-model-len 8192 --chat-template examples/template_dse_qwen2_vl.jinja
```
=== "DSE-Qwen2-MRL"
:::{important}
Like with VLM2Vec, we have to explicitly pass `--task embed`.
To serve the model:
Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja>
:::
```bash
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
--trust-remote-code \
--max-model-len 8192 \
--chat-template examples/template_dse_qwen2_vl.jinja
```
:::{important}
`MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code
example below for details.
:::
!!! warning
Like with VLM2Vec, we have to explicitly pass `--task embed`.
::::
Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja>
:::::
!!! warning
`MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code
example below for details.
Full example: <gh-file:examples/online_serving/openai_chat_embedding_client_for_multimodal.py>
#### Extra parameters
The following [pooling parameters](#pooling-params) are supported.
The following [pooling parameters][pooling-params] are supported.
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-embedding-pooling-params
:end-before: end-embedding-pooling-params
:::
```python
--8<-- "vllm/entrypoints/openai/protocol.py:embedding-pooling-params"
```
The following extra parameters are supported by default:
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-embedding-extra-params
:end-before: end-embedding-extra-params
:::
```python
--8<-- "vllm/entrypoints/openai/protocol.py:embedding-extra-params"
```
For chat-like input (i.e. if `messages` is passed), these extra parameters are supported instead:
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-chat-embedding-extra-params
:end-before: end-chat-embedding-extra-params
:::
```python
--8<-- "vllm/entrypoints/openai/protocol.py:chat-embedding-extra-params"
```
(transcriptions-api)=
[](){ #transcriptions-api }
### Transcriptions API
Our Transcriptions API is compatible with [OpenAI's Transcriptions API](https://platform.openai.com/docs/api-reference/audio/createTranscription);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
:::{note}
To use the Transcriptions API, please install with extra audio dependencies using `pip install vllm[audio]`.
:::
!!! note
To use the Transcriptions API, please install with extra audio dependencies using `pip install vllm[audio]`.
Code example: <gh-file:examples/online_serving/openai_transcription_client.py>
<!-- TODO: api enforced limits + uploading audios -->
#### Extra Parameters
The following [sampling parameters](#sampling-params) are supported.
The following [sampling parameters][sampling-params] are supported.
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-transcription-sampling-params
:end-before: end-transcription-sampling-params
:::
```python
--8<-- "vllm/entrypoints/openai/protocol.py:transcription-sampling-params"
```
The following extra parameters are supported:
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-transcription-extra-params
:end-before: end-transcription-extra-params
:::
```python
--8<-- "vllm/entrypoints/openai/protocol.py:transcription-extra-params"
```
(tokenizer-api)=
[](){ #tokenizer-api }
### Tokenizer API
......@@ -433,17 +363,137 @@ It consists of two endpoints:
- `/tokenize` corresponds to calling `tokenizer.encode()`.
- `/detokenize` corresponds to calling `tokenizer.decode()`.
(pooling-api)=
[](){ #pooling-api }
### Pooling API
Our Pooling API encodes input prompts using a [pooling model](../models/pooling_models.md) and returns the corresponding hidden states.
The input format is the same as [Embeddings API](#embeddings-api), but the output data can contain an arbitrary nested list, not just a 1-D list of floats.
The input format is the same as [Embeddings API][embeddings-api], but the output data can contain an arbitrary nested list, not just a 1-D list of floats.
Code example: <gh-file:examples/online_serving/openai_pooling_client.py>
(score-api)=
[](){ #classification-api }
### Classification API
Our Classification API directly supports Hugging Face sequence-classification models such as [ai21labs/Jamba-tiny-reward-dev](https://huggingface.co/ai21labs/Jamba-tiny-reward-dev) and [jason9693/Qwen2.5-1.5B-apeach](https://huggingface.co/jason9693/Qwen2.5-1.5B-apeach).
We automatically wrap any other transformer via `as_classification_model()`, which pools on the last token, attaches a `RowParallelLinear` head, and applies a softmax to produce per-class probabilities.
Code example: <gh-file:examples/online_serving/openai_classification_client.py>
#### Example Requests
You can classify multiple texts by passing an array of strings:
Request:
```bash
curl -v "http://127.0.0.1:8000/classify" \
-H "Content-Type: application/json" \
-d '{
"model": "jason9693/Qwen2.5-1.5B-apeach",
"input": [
"Loved the new café—coffee was great.",
"This update broke everything. Frustrating."
]
}'
```
Response:
```bash
{
"id": "classify-7c87cac407b749a6935d8c7ce2a8fba2",
"object": "list",
"created": 1745383065,
"model": "jason9693/Qwen2.5-1.5B-apeach",
"data": [
{
"index": 0,
"label": "Default",
"probs": [
0.565970778465271,
0.4340292513370514
],
"num_classes": 2
},
{
"index": 1,
"label": "Spoiled",
"probs": [
0.26448777318000793,
0.7355121970176697
],
"num_classes": 2
}
],
"usage": {
"prompt_tokens": 20,
"total_tokens": 20,
"completion_tokens": 0,
"prompt_tokens_details": null
}
}
```
You can also pass a string directly to the `input` field:
Request:
```bash
curl -v "http://127.0.0.1:8000/classify" \
-H "Content-Type: application/json" \
-d '{
"model": "jason9693/Qwen2.5-1.5B-apeach",
"input": "Loved the new café—coffee was great."
}'
```
Response:
```bash
{
"id": "classify-9bf17f2847b046c7b2d5495f4b4f9682",
"object": "list",
"created": 1745383213,
"model": "jason9693/Qwen2.5-1.5B-apeach",
"data": [
{
"index": 0,
"label": "Default",
"probs": [
0.565970778465271,
0.4340292513370514
],
"num_classes": 2
}
],
"usage": {
"prompt_tokens": 10,
"total_tokens": 10,
"completion_tokens": 0,
"prompt_tokens_details": null
}
}
```
#### Extra parameters
The following [pooling parameters][pooling-params] are supported.
```python
--8<-- "vllm/entrypoints/openai/protocol.py:classification-pooling-params"
```
The following extra parameters are supported:
```python
--8<-- "vllm/entrypoints/openai/protocol.py:classification-extra-params"
```
[](){ #score-api }
### Score API
......@@ -590,23 +640,19 @@ Response:
#### Extra parameters
The following [pooling parameters](#pooling-params) are supported.
The following [pooling parameters][pooling-params] are supported.
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-score-pooling-params
:end-before: end-score-pooling-params
:::
```python
--8<-- "vllm/entrypoints/openai/protocol.py:score-pooling-params"
```
The following extra parameters are supported:
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-score-extra-params
:end-before: end-score-extra-params
:::
```python
--8<-- "vllm/entrypoints/openai/protocol.py:score-extra-params"
```
(rerank-api)=
[](){ #rerank-api }
### Re-rank API
......@@ -677,18 +723,14 @@ Response:
#### Extra parameters
The following [pooling parameters](#pooling-params) are supported.
The following [pooling parameters][pooling-params] are supported.
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-rerank-pooling-params
:end-before: end-rerank-pooling-params
:::
```python
--8<-- "vllm/entrypoints/openai/protocol.py:rerank-pooling-params"
```
The following extra parameters are supported:
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-rerank-extra-params
:end-before: end-rerank-extra-params
:::
```python
--8<-- "vllm/entrypoints/openai/protocol.py:rerank-extra-params"
```
.vertical-table-header th.head:not(.stub) {
writing-mode: sideways-lr;
white-space: nowrap;
max-width: 0;
p {
margin: 0;
}
}
<style>
.notification-bar {
width: 100vw;
display: flex;
justify-content: center;
align-items: center;
font-size: 16px;
padding: 0 6px 0 6px;
}
.notification-bar p {
margin: 0;
}
.notification-bar a {
font-weight: bold;
text-decoration: none;
}
/* Light mode styles (default) */
.notification-bar {
background-color: #fff3cd;
color: #856404;
}
.notification-bar a {
color: #d97706;
}
/* Dark mode styles */
html[data-theme=dark] .notification-bar {
background-color: #333;
color: #ddd;
}
html[data-theme=dark] .notification-bar a {
color: #ffa500; /* Brighter color for visibility */
}
</style>
<div class="notification-bar">
<p>You are viewing the latest developer preview docs. <a href="https://docs.vllm.ai/en/stable/">Click here</a> to view docs for the latest stable release.</p>
</div>
# AsyncLLMEngine
```{eval-rst}
.. autoclass:: vllm.AsyncLLMEngine
:members:
:show-inheritance:
```
# vLLM Engine
```{eval-rst}
.. automodule:: vllm.engine
```
```{eval-rst}
.. currentmodule:: vllm.engine
```
:::{toctree}
:caption: Engines
:maxdepth: 2
llm_engine
async_llm_engine
:::
# LLMEngine
```{eval-rst}
.. autoclass:: vllm.LLMEngine
:members:
:show-inheritance:
```
# Inference Parameters
Inference parameters for vLLM APIs.
(sampling-params)=
## Sampling Parameters
```{eval-rst}
.. autoclass:: vllm.SamplingParams
:members:
```
(pooling-params)=
## Pooling Parameters
```{eval-rst}
.. autoclass:: vllm.PoolingParams
:members:
```
# Model Adapters
## Module Contents
```{eval-rst}
.. automodule:: vllm.model_executor.models.adapters
:members:
:member-order: bysource
```
# Model Development
## Submodules
:::{toctree}
:maxdepth: 1
interfaces_base
interfaces
adapters
:::
# Optional Interfaces
## Module Contents
```{eval-rst}
.. automodule:: vllm.model_executor.models.interfaces
:members:
:member-order: bysource
```
# Base Model Interfaces
## Module Contents
```{eval-rst}
.. automodule:: vllm.model_executor.models.interfaces_base
:members:
:member-order: bysource
```
(multi-modality)=
# Multi-Modality
vLLM provides experimental support for multi-modal models through the {mod}`vllm.multimodal` package.
Multi-modal inputs can be passed alongside text and token prompts to [supported models](#supported-mm-models)
via the `multi_modal_data` field in {class}`vllm.inputs.PromptType`.
Looking to add your own multi-modal model? Please follow the instructions listed [here](#supports-multimodal).
## Module Contents
```{eval-rst}
.. autodata:: vllm.multimodal.MULTIMODAL_REGISTRY
```
## Submodules
:::{toctree}
:maxdepth: 1
inputs
parse
processing
profiling
registry
:::
# Input Definitions
## User-facing inputs
```{eval-rst}
.. autodata:: vllm.multimodal.inputs.MultiModalDataDict
```
## Internal data structures
```{eval-rst}
.. autoclass:: vllm.multimodal.inputs.PlaceholderRange
:members:
:show-inheritance:
```
```{eval-rst}
.. autodata:: vllm.multimodal.inputs.NestedTensors
```
```{eval-rst}
.. autoclass:: vllm.multimodal.inputs.MultiModalFieldElem
:members:
:show-inheritance:
```
```{eval-rst}
.. autoclass:: vllm.multimodal.inputs.MultiModalFieldConfig
:members:
:show-inheritance:
```
```{eval-rst}
.. autoclass:: vllm.multimodal.inputs.MultiModalKwargsItem
:members:
:show-inheritance:
```
```{eval-rst}
.. autoclass:: vllm.multimodal.inputs.MultiModalKwargs
:members:
:show-inheritance:
```
```{eval-rst}
.. autoclass:: vllm.multimodal.inputs.MultiModalInputs
:members:
:show-inheritance:
```
# Data Parsing
## Module Contents
```{eval-rst}
.. automodule:: vllm.multimodal.parse
:members:
:member-order: bysource
```
# Data Processing
## Module Contents
```{eval-rst}
.. automodule:: vllm.multimodal.processing
:members:
:member-order: bysource
```
# Memory Profiling
## Module Contents
```{eval-rst}
.. automodule:: vllm.multimodal.profiling
:members:
:member-order: bysource
```
# Registry
## Module Contents
```{eval-rst}
.. automodule:: vllm.multimodal.registry
:members:
:member-order: bysource
```
# Offline Inference
:::{toctree}
:caption: Contents
:maxdepth: 1
llm
llm_inputs
:::
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment