Commit 4c676e3d authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge tag 'v0.9.1' into v0.9.1-dev

parents b4c4464d b6553be1
(serving-llamaindex)= ---
title: LlamaIndex
# LlamaIndex ---
[](){ #serving-llamaindex }
vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index) . vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index) .
......
---
title: Offline Inference
---
[](){ #offline-inference }
You can run vLLM in your own code on a list of prompts.
The offline API is based on the [LLM][vllm.LLM] class.
To initialize the vLLM engine, create a new instance of `LLM` and specify the model to run.
For example, the following code downloads the [`facebook/opt-125m`](https://huggingface.co/facebook/opt-125m) model from HuggingFace
and runs it in vLLM using the default configuration.
```python
from vllm import LLM
llm = LLM(model="facebook/opt-125m")
```
After initializing the `LLM` instance, you can perform model inference using various APIs.
The available APIs depend on the type of model that is being run:
- [Generative models][generative-models] output logprobs which are sampled from to obtain the final output text.
- [Pooling models][pooling-models] output their hidden states directly.
Please refer to the above pages for more details about each API.
!!! info
[API Reference][offline-inference-api]
(openai-compatible-server)= ---
title: OpenAI-Compatible Server
# OpenAI-Compatible Server ---
[](){ #openai-compatible-server }
vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! This functionality lets you serve models and interact with them using an HTTP client. vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! This functionality lets you serve models and interact with them using an HTTP client.
In your terminal, you can [install](../getting_started/installation.md) vLLM, then start the server with the [`vllm serve`](#vllm-serve) command. (You can also use our [Docker](#deployment-docker) image.) In your terminal, you can [install](../getting_started/installation/README.md) vLLM, then start the server with the [`vllm serve`][serve-args] command. (You can also use our [Docker][deployment-docker] image.)
```bash ```bash
vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123 vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
--dtype auto \
--api-key token-abc123
``` ```
To call the server, in your preferred text editor, create a script that uses an HTTP client. Include any messages that you want to send to the model. Then run that script. Below is an example script using the [official OpenAI Python client](https://github.com/openai/openai-python). To call the server, in your preferred text editor, create a script that uses an HTTP client. Include any messages that you want to send to the model. Then run that script. Below is an example script using the [official OpenAI Python client](https://github.com/openai/openai-python).
...@@ -20,56 +23,56 @@ client = OpenAI( ...@@ -20,56 +23,56 @@ client = OpenAI(
) )
completion = client.chat.completions.create( completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct", model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[ messages=[
{"role": "user", "content": "Hello!"} {"role": "user", "content": "Hello!"}
] ]
) )
print(completion.choices[0].message) print(completion.choices[0].message)
``` ```
:::{tip} !!! tip
vLLM supports some parameters that are not supported by OpenAI, `top_k` for example. vLLM supports some parameters that are not supported by OpenAI, `top_k` for example.
You can pass these parameters to vLLM using the OpenAI client in the `extra_body` parameter of your requests, i.e. `extra_body={"top_k": 50}` for `top_k`. You can pass these parameters to vLLM using the OpenAI client in the `extra_body` parameter of your requests, i.e. `extra_body={"top_k": 50}` for `top_k`.
:::
:::{important} !!! warning
By default, the server applies `generation_config.json` from the Hugging Face model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator. By default, the server applies `generation_config.json` from the Hugging Face model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
To disable this behavior, please pass `--generation-config vllm` when launching the server. To disable this behavior, please pass `--generation-config vllm` when launching the server.
:::
## Supported APIs ## Supported APIs
We currently support the following OpenAI APIs: We currently support the following OpenAI APIs:
- [Completions API](#completions-api) (`/v1/completions`) - [Completions API][completions-api] (`/v1/completions`)
- Only applicable to [text generation models](../models/generative_models.md) (`--task generate`). - Only applicable to [text generation models](../models/generative_models.md) (`--task generate`).
- *Note: `suffix` parameter is not supported.* - *Note: `suffix` parameter is not supported.*
- [Chat Completions API](#chat-api) (`/v1/chat/completions`) - [Chat Completions API][chat-api] (`/v1/chat/completions`)
- Only applicable to [text generation models](../models/generative_models.md) (`--task generate`) with a [chat template](#chat-template). - Only applicable to [text generation models](../models/generative_models.md) (`--task generate`) with a [chat template][chat-template].
- *Note: `parallel_tool_calls` and `user` parameters are ignored.* - *Note: `parallel_tool_calls` and `user` parameters are ignored.*
- [Embeddings API](#embeddings-api) (`/v1/embeddings`) - [Embeddings API][embeddings-api] (`/v1/embeddings`)
- Only applicable to [embedding models](../models/pooling_models.md) (`--task embed`). - Only applicable to [embedding models](../models/pooling_models.md) (`--task embed`).
- [Transcriptions API](#transcriptions-api) (`/v1/audio/transcriptions`) - [Transcriptions API][transcriptions-api] (`/v1/audio/transcriptions`)
- Only applicable to Automatic Speech Recognition (ASR) models (OpenAI Whisper) (`--task generate`). - Only applicable to Automatic Speech Recognition (ASR) models (OpenAI Whisper) (`--task generate`).
In addition, we have the following custom APIs: In addition, we have the following custom APIs:
- [Tokenizer API](#tokenizer-api) (`/tokenize`, `/detokenize`) - [Tokenizer API][tokenizer-api] (`/tokenize`, `/detokenize`)
- Applicable to any model with a tokenizer. - Applicable to any model with a tokenizer.
- [Pooling API](#pooling-api) (`/pooling`) - [Pooling API][pooling-api] (`/pooling`)
- Applicable to all [pooling models](../models/pooling_models.md). - Applicable to all [pooling models](../models/pooling_models.md).
- [Score API](#score-api) (`/score`) - [Classification API][classification-api] (`/classify`)
- Applicable to embedding models and [cross-encoder models](../models/pooling_models.md) (`--task score`). - Only applicable to [classification models](../models/pooling_models.md) (`--task classify`).
- [Re-rank API](#rerank-api) (`/rerank`, `/v1/rerank`, `/v2/rerank`) - [Score API][score-api] (`/score`)
- Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/) - Applicable to embedding models and [cross-encoder models](../models/pooling_models.md) (`--task score`).
- Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank) - [Re-rank API][rerank-api] (`/rerank`, `/v1/rerank`, `/v2/rerank`)
- Jina and Cohere's APIs are very similar; Jina's includes extra information in the rerank endpoint's response. - Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/)
- Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`). - Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank)
- Jina and Cohere's APIs are very similar; Jina's includes extra information in the rerank endpoint's response.
(chat-template)= - Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`).
[](){ #chat-template }
## Chat Template ## Chat Template
...@@ -95,10 +98,10 @@ both a `type` and a `text` field. An example is provided below: ...@@ -95,10 +98,10 @@ both a `type` and a `text` field. An example is provided below:
```python ```python
completion = client.chat.completions.create( completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct", model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[ messages=[
{"role": "user", "content": [{"type": "text", "text": "Classify this sentiment: vLLM is wonderful!"}]} {"role": "user", "content": [{"type": "text", "text": "Classify this sentiment: vLLM is wonderful!"}]}
] ]
) )
``` ```
...@@ -109,9 +112,9 @@ request. vLLM provides best-effort support to detect this automatically, which i ...@@ -109,9 +112,9 @@ request. vLLM provides best-effort support to detect this automatically, which i
the detected format, which can be one of: the detected format, which can be one of:
- `"string"`: A string. - `"string"`: A string.
- Example: `"Hello world"` - Example: `"Hello world"`
- `"openai"`: A list of dictionaries, similar to OpenAI schema. - `"openai"`: A list of dictionaries, similar to OpenAI schema.
- Example: `[{"type": "text", "text": "Hello world!"}]` - Example: `[{"type": "text", "text": "Hello world!"}]`
If the result is not what you expect, you can set the `--chat-template-content-format` CLI argument If the result is not what you expect, you can set the `--chat-template-content-format` CLI argument
to override which format to use. to override which format to use.
...@@ -124,13 +127,13 @@ Or directly merge them into the JSON payload if you are using HTTP call directly ...@@ -124,13 +127,13 @@ Or directly merge them into the JSON payload if you are using HTTP call directly
```python ```python
completion = client.chat.completions.create( completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct", model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[ messages=[
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"} {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
], ],
extra_body={ extra_body={
"guided_choice": ["positive", "negative"] "guided_choice": ["positive", "negative"]
} }
) )
``` ```
...@@ -146,77 +149,29 @@ with `--enable-request-id-headers`. ...@@ -146,77 +149,29 @@ with `--enable-request-id-headers`.
```python ```python
completion = client.chat.completions.create( completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct", model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[ messages=[
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"} {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
], ],
extra_headers={ extra_headers={
"x-request-id": "sentiment-classification-00001", "x-request-id": "sentiment-classification-00001",
} }
) )
print(completion._request_id) print(completion._request_id)
completion = client.completions.create( completion = client.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct", model="NousResearch/Meta-Llama-3-8B-Instruct",
prompt="A robot may not injure a human being", prompt="A robot may not injure a human being",
extra_headers={ extra_headers={
"x-request-id": "completion-test", "x-request-id": "completion-test",
} }
) )
print(completion._request_id) print(completion._request_id)
``` ```
## CLI Reference
(vllm-serve)=
### `vllm serve`
The `vllm serve` command is used to launch the OpenAI-compatible server.
:::{tip}
The vast majority of command-line arguments are based on those for offline inference.
See [here](configuration-options) for some common options.
:::
:::{argparse}
:module: vllm.entrypoints.openai.cli_args
:func: create_parser_for_docs
:prog: vllm serve
:::
#### Configuration file
You can load CLI arguments via a [YAML](https://yaml.org/) config file.
The argument names must be the long form of those outlined [above](#vllm-serve).
For example:
```yaml
# config.yaml
model: meta-llama/Llama-3.1-8B-Instruct
host: "127.0.0.1"
port: 6379
uvicorn-log-level: "info"
```
To use the above config file:
```bash
vllm serve --config config.yaml
```
:::{note}
In case an argument is supplied simultaneously using command line and the config file, the value from the command line will take precedence.
The order of priorities is `command line > config file values > defaults`.
e.g. `vllm serve SOME_MODEL --config config.yaml`, SOME_MODEL takes precedence over `model` in config file.
:::
## API Reference ## API Reference
(completions-api)= [](){ #completions-api }
### Completions API ### Completions API
...@@ -227,23 +182,19 @@ Code example: <gh-file:examples/online_serving/openai_completion_client.py> ...@@ -227,23 +182,19 @@ Code example: <gh-file:examples/online_serving/openai_completion_client.py>
#### Extra parameters #### Extra parameters
The following [sampling parameters](#sampling-params) are supported. The following [sampling parameters][sampling-params] are supported.
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py ```python
:language: python --8<-- "vllm/entrypoints/openai/protocol.py:completion-sampling-params"
:start-after: begin-completion-sampling-params ```
:end-before: end-completion-sampling-params
:::
The following extra parameters are supported: The following extra parameters are supported:
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py ```python
:language: python --8<-- "vllm/entrypoints/openai/protocol.py:completion-extra-params"
:start-after: begin-completion-extra-params ```
:end-before: end-completion-extra-params
:::
(chat-api)= [](){ #chat-api }
### Chat API ### Chat API
...@@ -252,37 +203,33 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai ...@@ -252,37 +203,33 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai
We support both [Vision](https://platform.openai.com/docs/guides/vision)- and We support both [Vision](https://platform.openai.com/docs/guides/vision)- and
[Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters; [Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters;
see our [Multimodal Inputs](#multimodal-inputs) guide for more information. see our [Multimodal Inputs][multimodal-inputs] guide for more information.
- *Note: `image_url.detail` parameter is not supported.* - *Note: `image_url.detail` parameter is not supported.*
Code example: <gh-file:examples/online_serving/openai_chat_completion_client.py> Code example: <gh-file:examples/online_serving/openai_chat_completion_client.py>
#### Extra parameters #### Extra parameters
The following [sampling parameters](#sampling-params) are supported. The following [sampling parameters][sampling-params] are supported.
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py ```python
:language: python --8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-sampling-params"
:start-after: begin-chat-completion-sampling-params ```
:end-before: end-chat-completion-sampling-params
:::
The following extra parameters are supported: The following extra parameters are supported:
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py ```python
:language: python --8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-extra-params"
:start-after: begin-chat-completion-extra-params ```
:end-before: end-chat-completion-extra-params
:::
(embeddings-api)= [](){ #embeddings-api }
### Embeddings API ### Embeddings API
Our Embeddings API is compatible with [OpenAI's Embeddings API](https://platform.openai.com/docs/api-reference/embeddings); Our Embeddings API is compatible with [OpenAI's Embeddings API](https://platform.openai.com/docs/api-reference/embeddings);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it. you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
If the model has a [chat template](#chat-template), you can replace `inputs` with a list of `messages` (same schema as [Chat API](#chat-api)) If the model has a [chat template][chat-template], you can replace `inputs` with a list of `messages` (same schema as [Chat API][chat-api])
which will be treated as a single prompt to the model. which will be treated as a single prompt to the model.
Code example: <gh-file:examples/online_serving/openai_embedding_client.py> Code example: <gh-file:examples/online_serving/openai_embedding_client.py>
...@@ -292,138 +239,121 @@ Code example: <gh-file:examples/online_serving/openai_embedding_client.py> ...@@ -292,138 +239,121 @@ Code example: <gh-file:examples/online_serving/openai_embedding_client.py>
You can pass multi-modal inputs to embedding models by defining a custom chat template for the server You can pass multi-modal inputs to embedding models by defining a custom chat template for the server
and passing a list of `messages` in the request. Refer to the examples below for illustration. and passing a list of `messages` in the request. Refer to the examples below for illustration.
:::::{tab-set} === "VLM2Vec"
::::{tab-item} VLM2Vec
To serve the model:
```bash To serve the model:
vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
--trust-remote-code --max-model-len 4096 --chat-template examples/template_vlm2vec.jinja
```
:::{important} ```bash
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--task embed` vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
to run this model in embedding mode instead of text generation mode. --trust-remote-code \
--max-model-len 4096 \
--chat-template examples/template_vlm2vec.jinja
```
The custom chat template is completely different from the original one for this model, !!! warning
and can be found here: <gh-file:examples/template_vlm2vec.jinja> Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--task embed`
::: to run this model in embedding mode instead of text generation mode.
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library: The custom chat template is completely different from the original one for this model,
and can be found here: <gh-file:examples/template_vlm2vec.jinja>
```python Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
import requests
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
response = requests.post(
"http://localhost:8000/v1/embeddings",
json={
"model": "TIGER-Lab/VLM2Vec-Full",
"messages": [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": image_url}},
{"type": "text", "text": "Represent the given image."},
],
}],
"encoding_format": "float",
},
)
response.raise_for_status()
response_json = response.json()
print("Embedding output:", response_json["data"][0]["embedding"])
```
:::: ```python
import requests
::::{tab-item} DSE-Qwen2-MRL image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
To serve the model: response = requests.post(
"http://localhost:8000/v1/embeddings",
json={
"model": "TIGER-Lab/VLM2Vec-Full",
"messages": [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": image_url}},
{"type": "text", "text": "Represent the given image."},
],
}],
"encoding_format": "float",
},
)
response.raise_for_status()
response_json = response.json()
print("Embedding output:", response_json["data"][0]["embedding"])
```
```bash === "DSE-Qwen2-MRL"
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
--trust-remote-code --max-model-len 8192 --chat-template examples/template_dse_qwen2_vl.jinja
```
:::{important} To serve the model:
Like with VLM2Vec, we have to explicitly pass `--task embed`.
Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled ```bash
by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja> vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
::: --trust-remote-code \
--max-model-len 8192 \
--chat-template examples/template_dse_qwen2_vl.jinja
```
:::{important} !!! warning
`MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code Like with VLM2Vec, we have to explicitly pass `--task embed`.
example below for details.
:::
:::: Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja>
::::: !!! warning
`MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code
example below for details.
Full example: <gh-file:examples/online_serving/openai_chat_embedding_client_for_multimodal.py> Full example: <gh-file:examples/online_serving/openai_chat_embedding_client_for_multimodal.py>
#### Extra parameters #### Extra parameters
The following [pooling parameters](#pooling-params) are supported. The following [pooling parameters][pooling-params] are supported.
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py ```python
:language: python --8<-- "vllm/entrypoints/openai/protocol.py:embedding-pooling-params"
:start-after: begin-embedding-pooling-params ```
:end-before: end-embedding-pooling-params
:::
The following extra parameters are supported by default: The following extra parameters are supported by default:
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py ```python
:language: python --8<-- "vllm/entrypoints/openai/protocol.py:embedding-extra-params"
:start-after: begin-embedding-extra-params ```
:end-before: end-embedding-extra-params
:::
For chat-like input (i.e. if `messages` is passed), these extra parameters are supported instead: For chat-like input (i.e. if `messages` is passed), these extra parameters are supported instead:
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py ```python
:language: python --8<-- "vllm/entrypoints/openai/protocol.py:chat-embedding-extra-params"
:start-after: begin-chat-embedding-extra-params ```
:end-before: end-chat-embedding-extra-params
:::
(transcriptions-api)= [](){ #transcriptions-api }
### Transcriptions API ### Transcriptions API
Our Transcriptions API is compatible with [OpenAI's Transcriptions API](https://platform.openai.com/docs/api-reference/audio/createTranscription); Our Transcriptions API is compatible with [OpenAI's Transcriptions API](https://platform.openai.com/docs/api-reference/audio/createTranscription);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it. you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
:::{note} !!! note
To use the Transcriptions API, please install with extra audio dependencies using `pip install vllm[audio]`. To use the Transcriptions API, please install with extra audio dependencies using `pip install vllm[audio]`.
:::
Code example: <gh-file:examples/online_serving/openai_transcription_client.py> Code example: <gh-file:examples/online_serving/openai_transcription_client.py>
<!-- TODO: api enforced limits + uploading audios --> <!-- TODO: api enforced limits + uploading audios -->
#### Extra Parameters #### Extra Parameters
The following [sampling parameters](#sampling-params) are supported. The following [sampling parameters][sampling-params] are supported.
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py ```python
:language: python --8<-- "vllm/entrypoints/openai/protocol.py:transcription-sampling-params"
:start-after: begin-transcription-sampling-params ```
:end-before: end-transcription-sampling-params
:::
The following extra parameters are supported: The following extra parameters are supported:
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py ```python
:language: python --8<-- "vllm/entrypoints/openai/protocol.py:transcription-extra-params"
:start-after: begin-transcription-extra-params ```
:end-before: end-transcription-extra-params
:::
(tokenizer-api)= [](){ #tokenizer-api }
### Tokenizer API ### Tokenizer API
...@@ -433,17 +363,137 @@ It consists of two endpoints: ...@@ -433,17 +363,137 @@ It consists of two endpoints:
- `/tokenize` corresponds to calling `tokenizer.encode()`. - `/tokenize` corresponds to calling `tokenizer.encode()`.
- `/detokenize` corresponds to calling `tokenizer.decode()`. - `/detokenize` corresponds to calling `tokenizer.decode()`.
(pooling-api)= [](){ #pooling-api }
### Pooling API ### Pooling API
Our Pooling API encodes input prompts using a [pooling model](../models/pooling_models.md) and returns the corresponding hidden states. Our Pooling API encodes input prompts using a [pooling model](../models/pooling_models.md) and returns the corresponding hidden states.
The input format is the same as [Embeddings API](#embeddings-api), but the output data can contain an arbitrary nested list, not just a 1-D list of floats. The input format is the same as [Embeddings API][embeddings-api], but the output data can contain an arbitrary nested list, not just a 1-D list of floats.
Code example: <gh-file:examples/online_serving/openai_pooling_client.py> Code example: <gh-file:examples/online_serving/openai_pooling_client.py>
(score-api)= [](){ #classification-api }
### Classification API
Our Classification API directly supports Hugging Face sequence-classification models such as [ai21labs/Jamba-tiny-reward-dev](https://huggingface.co/ai21labs/Jamba-tiny-reward-dev) and [jason9693/Qwen2.5-1.5B-apeach](https://huggingface.co/jason9693/Qwen2.5-1.5B-apeach).
We automatically wrap any other transformer via `as_classification_model()`, which pools on the last token, attaches a `RowParallelLinear` head, and applies a softmax to produce per-class probabilities.
Code example: <gh-file:examples/online_serving/openai_classification_client.py>
#### Example Requests
You can classify multiple texts by passing an array of strings:
Request:
```bash
curl -v "http://127.0.0.1:8000/classify" \
-H "Content-Type: application/json" \
-d '{
"model": "jason9693/Qwen2.5-1.5B-apeach",
"input": [
"Loved the new café—coffee was great.",
"This update broke everything. Frustrating."
]
}'
```
Response:
```bash
{
"id": "classify-7c87cac407b749a6935d8c7ce2a8fba2",
"object": "list",
"created": 1745383065,
"model": "jason9693/Qwen2.5-1.5B-apeach",
"data": [
{
"index": 0,
"label": "Default",
"probs": [
0.565970778465271,
0.4340292513370514
],
"num_classes": 2
},
{
"index": 1,
"label": "Spoiled",
"probs": [
0.26448777318000793,
0.7355121970176697
],
"num_classes": 2
}
],
"usage": {
"prompt_tokens": 20,
"total_tokens": 20,
"completion_tokens": 0,
"prompt_tokens_details": null
}
}
```
You can also pass a string directly to the `input` field:
Request:
```bash
curl -v "http://127.0.0.1:8000/classify" \
-H "Content-Type: application/json" \
-d '{
"model": "jason9693/Qwen2.5-1.5B-apeach",
"input": "Loved the new café—coffee was great."
}'
```
Response:
```bash
{
"id": "classify-9bf17f2847b046c7b2d5495f4b4f9682",
"object": "list",
"created": 1745383213,
"model": "jason9693/Qwen2.5-1.5B-apeach",
"data": [
{
"index": 0,
"label": "Default",
"probs": [
0.565970778465271,
0.4340292513370514
],
"num_classes": 2
}
],
"usage": {
"prompt_tokens": 10,
"total_tokens": 10,
"completion_tokens": 0,
"prompt_tokens_details": null
}
}
```
#### Extra parameters
The following [pooling parameters][pooling-params] are supported.
```python
--8<-- "vllm/entrypoints/openai/protocol.py:classification-pooling-params"
```
The following extra parameters are supported:
```python
--8<-- "vllm/entrypoints/openai/protocol.py:classification-extra-params"
```
[](){ #score-api }
### Score API ### Score API
...@@ -590,23 +640,19 @@ Response: ...@@ -590,23 +640,19 @@ Response:
#### Extra parameters #### Extra parameters
The following [pooling parameters](#pooling-params) are supported. The following [pooling parameters][pooling-params] are supported.
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py ```python
:language: python --8<-- "vllm/entrypoints/openai/protocol.py:score-pooling-params"
:start-after: begin-score-pooling-params ```
:end-before: end-score-pooling-params
:::
The following extra parameters are supported: The following extra parameters are supported:
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py ```python
:language: python --8<-- "vllm/entrypoints/openai/protocol.py:score-extra-params"
:start-after: begin-score-extra-params ```
:end-before: end-score-extra-params
:::
(rerank-api)= [](){ #rerank-api }
### Re-rank API ### Re-rank API
...@@ -677,18 +723,14 @@ Response: ...@@ -677,18 +723,14 @@ Response:
#### Extra parameters #### Extra parameters
The following [pooling parameters](#pooling-params) are supported. The following [pooling parameters][pooling-params] are supported.
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py ```python
:language: python --8<-- "vllm/entrypoints/openai/protocol.py:rerank-pooling-params"
:start-after: begin-rerank-pooling-params ```
:end-before: end-rerank-pooling-params
:::
The following extra parameters are supported: The following extra parameters are supported:
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py ```python
:language: python --8<-- "vllm/entrypoints/openai/protocol.py:rerank-extra-params"
:start-after: begin-rerank-extra-params ```
:end-before: end-rerank-extra-params
:::
.vertical-table-header th.head:not(.stub) {
writing-mode: sideways-lr;
white-space: nowrap;
max-width: 0;
p {
margin: 0;
}
}
<style>
.notification-bar {
width: 100vw;
display: flex;
justify-content: center;
align-items: center;
font-size: 16px;
padding: 0 6px 0 6px;
}
.notification-bar p {
margin: 0;
}
.notification-bar a {
font-weight: bold;
text-decoration: none;
}
/* Light mode styles (default) */
.notification-bar {
background-color: #fff3cd;
color: #856404;
}
.notification-bar a {
color: #d97706;
}
/* Dark mode styles */
html[data-theme=dark] .notification-bar {
background-color: #333;
color: #ddd;
}
html[data-theme=dark] .notification-bar a {
color: #ffa500; /* Brighter color for visibility */
}
</style>
<div class="notification-bar">
<p>You are viewing the latest developer preview docs. <a href="https://docs.vllm.ai/en/stable/">Click here</a> to view docs for the latest stable release.</p>
</div>
# AsyncLLMEngine
```{eval-rst}
.. autoclass:: vllm.AsyncLLMEngine
:members:
:show-inheritance:
```
# vLLM Engine
```{eval-rst}
.. automodule:: vllm.engine
```
```{eval-rst}
.. currentmodule:: vllm.engine
```
:::{toctree}
:caption: Engines
:maxdepth: 2
llm_engine
async_llm_engine
:::
# LLMEngine
```{eval-rst}
.. autoclass:: vllm.LLMEngine
:members:
:show-inheritance:
```
# Inference Parameters
Inference parameters for vLLM APIs.
(sampling-params)=
## Sampling Parameters
```{eval-rst}
.. autoclass:: vllm.SamplingParams
:members:
```
(pooling-params)=
## Pooling Parameters
```{eval-rst}
.. autoclass:: vllm.PoolingParams
:members:
```
# Model Adapters
## Module Contents
```{eval-rst}
.. automodule:: vllm.model_executor.models.adapters
:members:
:member-order: bysource
```
# Model Development
## Submodules
:::{toctree}
:maxdepth: 1
interfaces_base
interfaces
adapters
:::
# Optional Interfaces
## Module Contents
```{eval-rst}
.. automodule:: vllm.model_executor.models.interfaces
:members:
:member-order: bysource
```
# Base Model Interfaces
## Module Contents
```{eval-rst}
.. automodule:: vllm.model_executor.models.interfaces_base
:members:
:member-order: bysource
```
(multi-modality)=
# Multi-Modality
vLLM provides experimental support for multi-modal models through the {mod}`vllm.multimodal` package.
Multi-modal inputs can be passed alongside text and token prompts to [supported models](#supported-mm-models)
via the `multi_modal_data` field in {class}`vllm.inputs.PromptType`.
Looking to add your own multi-modal model? Please follow the instructions listed [here](#supports-multimodal).
## Module Contents
```{eval-rst}
.. autodata:: vllm.multimodal.MULTIMODAL_REGISTRY
```
## Submodules
:::{toctree}
:maxdepth: 1
inputs
parse
processing
profiling
registry
:::
# Input Definitions
## User-facing inputs
```{eval-rst}
.. autodata:: vllm.multimodal.inputs.MultiModalDataDict
```
## Internal data structures
```{eval-rst}
.. autoclass:: vllm.multimodal.inputs.PlaceholderRange
:members:
:show-inheritance:
```
```{eval-rst}
.. autodata:: vllm.multimodal.inputs.NestedTensors
```
```{eval-rst}
.. autoclass:: vllm.multimodal.inputs.MultiModalFieldElem
:members:
:show-inheritance:
```
```{eval-rst}
.. autoclass:: vllm.multimodal.inputs.MultiModalFieldConfig
:members:
:show-inheritance:
```
```{eval-rst}
.. autoclass:: vllm.multimodal.inputs.MultiModalKwargsItem
:members:
:show-inheritance:
```
```{eval-rst}
.. autoclass:: vllm.multimodal.inputs.MultiModalKwargs
:members:
:show-inheritance:
```
```{eval-rst}
.. autoclass:: vllm.multimodal.inputs.MultiModalInputs
:members:
:show-inheritance:
```
# Data Parsing
## Module Contents
```{eval-rst}
.. automodule:: vllm.multimodal.parse
:members:
:member-order: bysource
```
# Data Processing
## Module Contents
```{eval-rst}
.. automodule:: vllm.multimodal.processing
:members:
:member-order: bysource
```
# Memory Profiling
## Module Contents
```{eval-rst}
.. automodule:: vllm.multimodal.profiling
:members:
:member-order: bysource
```
# Registry
## Module Contents
```{eval-rst}
.. automodule:: vllm.multimodal.registry
:members:
:member-order: bysource
```
# Offline Inference
:::{toctree}
:caption: Contents
:maxdepth: 1
llm
llm_inputs
:::
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment