Commit afd0da21 authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge tag 'v0.7.1' into v0.7.1-dev

parents 1a11f127 4f4d427a
# Integrations
```{toctree}
:maxdepth: 1
run_on_sky
deploying_with_kserve
deploying_with_kubeai
deploying_with_triton
deploying_with_bentoml
deploying_with_cerebrium
deploying_with_lws
deploying_with_dstack
serving_with_langchain
serving_with_llamaindex
serving_with_llamastack
```
# External Integrations
:::{toctree}
:maxdepth: 1
langchain
llamaindex
:::
(run-on-langchain)=
(serving-langchain)=
# Serving with Langchain
# LangChain
vLLM is also available via [Langchain](https://github.com/langchain-ai/langchain) .
vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain) .
To install langchain, run
To install LangChain, run
```console
$ pip install langchain langchain_community -q
pip install langchain langchain_community -q
```
To run inference on a single or multiple GPUs, use `VLLM` class from `langchain`.
......
(run-on-llamaindex)=
(serving-llamaindex)=
# Serving with llama_index
# LlamaIndex
vLLM is also available via [llama_index](https://github.com/run-llama/llama_index) .
vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index) .
To install llamaindex, run
To install LlamaIndex, run
```console
$ pip install llama-index-llms-vllm -q
pip install llama-index-llms-vllm -q
```
To run inference on a single or multiple GPUs, use `Vllm` class from `llamaindex`.
......
......@@ -4,10 +4,10 @@ vLLM exposes a number of metrics that can be used to monitor the health of the
system. These metrics are exposed via the `/metrics` endpoint on the vLLM
OpenAI compatible API server.
You can start the server using Python, or using [Docker](deploying_with_docker.md):
You can start the server using Python, or using [Docker](#deployment-docker):
```console
$ vllm serve unsloth/Llama-3.2-1B-Instruct
vllm serve unsloth/Llama-3.2-1B-Instruct
```
Then query the endpoint to get the latest metrics from the server:
......@@ -31,8 +31,8 @@ vllm:iteration_tokens_total_bucket{le="512.0",model_name="unsloth/Llama-3.2-1B-I
The following metrics are exposed:
```{literalinclude} ../../../vllm/engine/metrics.py
:::{literalinclude} ../../../vllm/engine/metrics.py
:end-before: end-metrics-definitions
:language: python
:start-after: begin-metrics-definitions
```
:::
......@@ -4,21 +4,21 @@
This page teaches you how to pass multi-modal inputs to [multi-modal models](#supported-mm-models) in vLLM.
```{note}
:::{note}
We are actively iterating on multi-modal support. See [this RFC](gh-issue:4194) for upcoming changes,
and [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) if you have any feedback or feature requests.
```
:::
## Offline Inference
To input multi-modal data, follow this schema in {class}`vllm.inputs.PromptType`:
- `prompt`: The prompt should follow the format that is documented on HuggingFace.
- `multi_modal_data`: This is a dictionary that follows the schema defined in {class}`vllm.multimodal.MultiModalDataDict`.
- `multi_modal_data`: This is a dictionary that follows the schema defined in {class}`vllm.multimodal.inputs.MultiModalDataDict`.
### Image
You can pass a single image to the {code}`'image'` field of the multi-modal dictionary, as shown in the following examples:
You can pass a single image to the `'image'` field of the multi-modal dictionary, as shown in the following examples:
```python
llm = LLM(model="llava-hf/llava-1.5-7b-hf")
......@@ -60,7 +60,7 @@ for o in outputs:
print(generated_text)
```
Full example: <gh-file:examples/offline_inference_vision_language.py>
Full example: <gh-file:examples/offline_inference/vision_language.py>
To substitute multiple images inside the same text prompt, you can pass in a list of images instead:
......@@ -91,7 +91,7 @@ for o in outputs:
print(generated_text)
```
Full example: <gh-file:examples/offline_inference_vision_language_multi_image.py>
Full example: <gh-file:examples/offline_inference/vision_language_multi_image.py>
Multi-image input can be extended to perform video captioning. We show this with [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) as it supports videos:
......@@ -122,21 +122,21 @@ for o in outputs:
### Video
You can pass a list of NumPy arrays directly to the {code}`'video'` field of the multi-modal dictionary
You can pass a list of NumPy arrays directly to the `'video'` field of the multi-modal dictionary
instead of using multi-image input.
Full example: <gh-file:examples/offline_inference_vision_language.py>
Full example: <gh-file:examples/offline_inference/vision_language.py>
### Audio
You can pass a tuple {code}`(array, sampling_rate)` to the {code}`'audio'` field of the multi-modal dictionary.
You can pass a tuple `(array, sampling_rate)` to the `'audio'` field of the multi-modal dictionary.
Full example: <gh-file:examples/offline_inference_audio_language.py>
Full example: <gh-file:examples/offline_inference/audio_language.py>
### Embedding
To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model,
pass a tensor of shape {code}`(num_items, feature_size, hidden_size of LM)` to the corresponding field of the multi-modal dictionary.
pass a tensor of shape `(num_items, feature_size, hidden_size of LM)` to the corresponding field of the multi-modal dictionary.
```python
# Inference with image embeddings as input
......@@ -199,17 +199,17 @@ for o in outputs:
print(generated_text)
```
## Online Inference
## Online Serving
Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat).
```{important}
:::{important}
A chat template is **required** to use Chat Completions API.
Although most models come with a chat template, for others you have to define one yourself.
The chat template can be inferred based on the documentation on the model's HuggingFace repo.
For example, LLaVA-1.5 (`llava-hf/llava-1.5-7b-hf`) requires a chat template that can be found here: <gh-file:examples/template_llava.jinja>
```
:::
### Image
......@@ -271,30 +271,31 @@ chat_response = client.chat.completions.create(
print("Chat completion output:", chat_response.choices[0].message.content)
```
Full example: <gh-file:examples/openai_chat_completion_client_for_multimodal.py>
Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
```{tip}
:::{tip}
Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via `--allowed-local-media-path` when launching the API server/engine,
and pass the file path as `url` in the API request.
```
:::
```{tip}
:::{tip}
There is no need to place image placeholders in the text content of the API request - they are already represented by the image content.
In fact, you can place image placeholders in the middle of the text by interleaving text and image content.
```
:::
````{note}
:::{note}
By default, the timeout for fetching images through HTTP URL is `5` seconds.
You can override this by setting the environment variable:
```console
$ export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
```
````
:::
### Video
Instead of {code}`image_url`, you can pass a video file via {code}`video_url`. Here is a simple example using [LLaVA-OneVision](https://huggingface.co/llava-hf/llava-onevision-qwen2-0.5b-ov-hf).
Instead of `image_url`, you can pass a video file via `video_url`. Here is a simple example using [LLaVA-OneVision](https://huggingface.co/llava-hf/llava-onevision-qwen2-0.5b-ov-hf).
First, launch the OpenAI-compatible server:
......@@ -303,6 +304,7 @@ vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --task generate --max-model
```
Then, you can use the OpenAI client as follows:
```python
from openai import OpenAI
......@@ -342,16 +344,17 @@ result = chat_completion_from_url.choices[0].message.content
print("Chat completion output from image url:", result)
```
Full example: <gh-file:examples/openai_chat_completion_client_for_multimodal.py>
Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
````{note}
:::{note}
By default, the timeout for fetching videos through HTTP URL is `30` seconds.
You can override this by setting the environment variable:
```console
$ export VLLM_VIDEO_FETCH_TIMEOUT=<timeout>
export VLLM_VIDEO_FETCH_TIMEOUT=<timeout>
```
````
:::
### Audio
......@@ -418,7 +421,7 @@ result = chat_completion_from_base64.choices[0].message.content
print("Chat completion output from input audio:", result)
```
Alternatively, you can pass {code}`audio_url`, which is the audio counterpart of {code}`image_url` for image input:
Alternatively, you can pass `audio_url`, which is the audio counterpart of `image_url` for image input:
```python
chat_completion_from_url = client.chat.completions.create(
......@@ -445,26 +448,27 @@ result = chat_completion_from_url.choices[0].message.content
print("Chat completion output from audio url:", result)
```
Full example: <gh-file:examples/openai_chat_completion_client_for_multimodal.py>
Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
````{note}
:::{note}
By default, the timeout for fetching audios through HTTP URL is `10` seconds.
You can override this by setting the environment variable:
```console
$ export VLLM_AUDIO_FETCH_TIMEOUT=<timeout>
export VLLM_AUDIO_FETCH_TIMEOUT=<timeout>
```
````
:::
### Embedding
vLLM's Embeddings API is a superset of OpenAI's [Embeddings API](https://platform.openai.com/docs/api-reference/embeddings),
where a list of chat `messages` can be passed instead of batched `inputs`. This enables multi-modal inputs to be passed to embedding models.
```{tip}
:::{tip}
The schema of `messages` is exactly the same as in Chat Completions API.
You can refer to the above tutorials for more details on how to pass each type of multi-modal data.
```
:::
Usually, embedding models do not expect chat-based input, so we need to use a custom chat template to format the text and images.
Refer to the examples below for illustration.
......@@ -476,13 +480,13 @@ vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
--trust-remote-code --max-model-len 4096 --chat-template examples/template_vlm2vec.jinja
```
```{important}
:::{important}
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--task embed`
to run this model in embedding mode instead of text generation mode.
The custom chat template is completely different from the original one for this model,
and can be found here: <gh-file:examples/template_vlm2vec.jinja>
```
:::
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
......@@ -517,16 +521,16 @@ vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
--trust-remote-code --max-model-len 8192 --chat-template examples/template_dse_qwen2_vl.jinja
```
```{important}
:::{important}
Like with VLM2Vec, we have to explicitly pass `--task embed`.
Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja>
```
:::
```{important}
:::{important}
Also important, `MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code
example below for details.
```
:::
Full example: <gh-file:examples/openai_chat_embedding_client_for_multimodal.py>
Full example: <gh-file:examples/online_serving/openai_chat_embedding_client_for_multimodal.py>
(offline-inference)=
# Offline Inference
You can run vLLM in your own code on a list of prompts.
The offline API is based on the {class}`~vllm.LLM` class.
To initialize the vLLM engine, create a new instance of `LLM` and specify the model to run.
For example, the following code downloads the [`facebook/opt-125m`](https://huggingface.co/facebook/opt-125m) model from HuggingFace
and runs it in vLLM using the default configuration.
```python
llm = LLM(model="facebook/opt-125m")
```
After initializing the `LLM` instance, you can perform model inference using various APIs.
The available APIs depend on the type of model that is being run:
- [Generative models](#generative-models) output logprobs which are sampled from to obtain the final output text.
- [Pooling models](#pooling-models) output their hidden states directly.
Please refer to the above pages for more details about each API.
:::{seealso}
[API Reference](/api/offline_inference/index)
:::
## Configuration Options
This section lists the most common options for running the vLLM engine.
For a full list, refer to the [Engine Arguments](#engine-args) page.
(model-resolution)=
### Model resolution
vLLM loads HuggingFace-compatible models by inspecting the `architectures` field in `config.json` of the model repository
and finding the corresponding implementation that is registered to vLLM.
Nevertheless, our model resolution may fail for the following reasons:
- The `config.json` of the model repository lacks the `architectures` field.
- Unofficial repositories refer to a model using alternative names which are not recorded in vLLM.
- The same architecture name is used for multiple models, creating ambiguity as to which model should be loaded.
To fix this, explicitly specify the model architecture by passing `config.json` overrides to the `hf_overrides` option.
For example:
```python
model = LLM(
model="cerebras/Cerebras-GPT-1.3B",
hf_overrides={"architectures": ["GPT2LMHeadModel"]}, # GPT-2
)
```
Our [list of supported models](#supported-models) shows the model architectures that are recognized by vLLM.
### Reducing memory usage
Large models might cause your machine to run out of memory (OOM). Here are some options that help alleviate this problem.
#### Tensor Parallelism (TP)
Tensor parallelism (`tensor_parallel_size` option) can be used to split the model across multiple GPUs.
The following code splits the model across 2 GPUs.
```python
llm = LLM(model="ibm-granite/granite-3.1-8b-instruct",
tensor_parallel_size=2)
```
:::{important}
To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. {func}`torch.cuda.set_device`)
before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.
To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.
:::
#### Quantization
Quantized models take less memory at the cost of lower precision.
Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Neural Magic](https://huggingface.co/neuralmagic))
and used directly without extra configuration.
Dynamic quantization is also supported via the `quantization` option -- see [here](#quantization-index) for more details.
#### Context length and batch size
You can further reduce memory usage by limiting the context length of the model (`max_model_len` option)
and the maximum batch size (`max_num_seqs` option).
```python
llm = LLM(model="adept/fuyu-8b",
max_model_len=2048,
max_num_seqs=2)
```
### Performance optimization and tuning
You can potentially improve the performance of vLLM by finetuning various options.
Please refer to [this guide](#optimization-and-tuning) for more details.
# OpenAI Compatible Server
(openai-compatible-server)=
vLLM provides an HTTP server that implements OpenAI's [Completions](https://platform.openai.com/docs/api-reference/completions) and [Chat](https://platform.openai.com/docs/api-reference/chat) API, and more!
# OpenAI-Compatible Server
vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more!
You can start the server via the [`vllm serve`](#vllm-serve) command, or through [Docker](#deployment-docker):
You can start the server via the [`vllm serve`](#vllm-serve) command, or through [Docker](deploying_with_docker.md):
```bash
vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
```
To call the server, you can use the [official OpenAI Python client](https://github.com/openai/openai-python), or any other HTTP client.
```python
from openai import OpenAI
client = OpenAI(
......@@ -46,8 +50,14 @@ In addition, we have the following custom APIs:
- Applicable to all [pooling models](../models/pooling_models.md).
- [Score API](#score-api) (`/score`)
- Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`).
- [Re-rank API](#rerank-api) (`/rerank`, `/v1/rerank`, `/v2/rerank`)
- Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/)
- Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank)
- Jina and Cohere's APIs are very similar; Jina's includes extra information in the rerank endpoint's response.
- Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`).
(chat-template)=
## Chat Template
In order for the language model to support chat protocol, vLLM requires the model to include
......@@ -69,6 +79,7 @@ vLLM community provides a set of chat templates for popular models. You can find
With the inclusion of multi-modal chat APIs, the OpenAI spec now accepts chat messages in a new format which specifies
both a `type` and a `text` field. An example is provided below:
```python
completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
......@@ -78,7 +89,7 @@ completion = client.chat.completions.create(
)
```
Most chat templates for LLMs expect the `content` field to be a string, but there are some newer models like
Most chat templates for LLMs expect the `content` field to be a string, but there are some newer models like
`meta-llama/Llama-Guard-3-1B` that expect the content to be formatted according to the OpenAI schema in the
request. vLLM provides best-effort support to detect this automatically, which is logged as a string like
*"Detected the chat template content format to be..."*, and internally converts incoming requests to match
......@@ -113,12 +124,12 @@ completion = client.chat.completions.create(
## Extra HTTP Headers
Only `X-Request-Id` HTTP request header is supported for now. It can be enabled
with `--enable-request-id-headers`.
with `--enable-request-id-headers`.
> Note that enablement of the headers can impact performance significantly at high QPS
> rates. We recommend implementing HTTP headers at the router level (e.g. via Istio),
> rather than within the vLLM layer for this reason.
> See https://github.com/vllm-project/vllm/pull/11529 for more details.
> See [this PR](https://github.com/vllm-project/vllm/pull/11529) for more details.
```python
completion = client.chat.completions.create(
......@@ -145,15 +156,16 @@ print(completion._request_id)
## CLI Reference
(vllm-serve)=
### `vllm serve`
The `vllm serve` command is used to launch the OpenAI-compatible server.
```{argparse}
:::{argparse}
:module: vllm.entrypoints.openai.cli_args
:func: create_parser_for_docs
:prog: vllm serve
```
:::
#### Configuration file
......@@ -173,43 +185,45 @@ uvicorn-log-level: "info"
To use the above config file:
```bash
$ vllm serve SOME_MODEL --config config.yaml
vllm serve SOME_MODEL --config config.yaml
```
```{note}
:::{note}
In case an argument is supplied simultaneously using command line and the config file, the value from the command line will take precedence.
The order of priorities is `command line > config file values > defaults`.
```
:::
## API Reference
(completions-api)=
### Completions API
Our Completions API is compatible with [OpenAI's Completions API](https://platform.openai.com/docs/api-reference/completions);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
Code example: <gh-file:examples/openai_completion_client.py>
Code example: <gh-file:examples/online_serving/openai_completion_client.py>
#### Extra parameters
The following [sampling parameters (click through to see documentation)](../dev/sampling_params.md) are supported.
The following [sampling parameters](#sampling-params) are supported.
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-completion-sampling-params
:end-before: end-completion-sampling-params
```
:::
The following extra parameters are supported:
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-completion-extra-params
:end-before: end-completion-extra-params
```
:::
(chat-api)=
### Chat API
Our Chat API is compatible with [OpenAI's Chat Completions API](https://platform.openai.com/docs/api-reference/chat);
......@@ -217,30 +231,31 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai
We support both [Vision](https://platform.openai.com/docs/guides/vision)- and
[Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters;
see our [Multimodal Inputs](../usage/multimodal_inputs.md) guide for more information.
see our [Multimodal Inputs](#multimodal-inputs) guide for more information.
- *Note: `image_url.detail` parameter is not supported.*
Code example: <gh-file:examples/openai_chat_completion_client.py>
Code example: <gh-file:examples/online_serving/openai_chat_completion_client.py>
#### Extra parameters
The following [sampling parameters (click through to see documentation)](../dev/sampling_params.md) are supported.
The following [sampling parameters](#sampling-params) are supported.
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-chat-completion-sampling-params
:end-before: end-chat-completion-sampling-params
```
:::
The following extra parameters are supported:
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-chat-completion-extra-params
:end-before: end-chat-completion-extra-params
```
:::
(embeddings-api)=
### Embeddings API
Our Embeddings API is compatible with [OpenAI's Embeddings API](https://platform.openai.com/docs/api-reference/embeddings);
......@@ -249,39 +264,40 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai
If the model has a [chat template](#chat-template), you can replace `inputs` with a list of `messages` (same schema as [Chat API](#chat-api))
which will be treated as a single prompt to the model.
```{tip}
:::{tip}
This enables multi-modal inputs to be passed to embedding models, see [this page](#multimodal-inputs) for details.
```
:::
Code example: <gh-file:examples/openai_embedding_client.py>
Code example: <gh-file:examples/online_serving/openai_embedding_client.py>
#### Extra parameters
The following [pooling parameters (click through to see documentation)](../dev/pooling_params.md) are supported.
The following [pooling parameters](#pooling-params) are supported.
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-embedding-pooling-params
:end-before: end-embedding-pooling-params
```
:::
The following extra parameters are supported by default:
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-embedding-extra-params
:end-before: end-embedding-extra-params
```
:::
For chat-like input (i.e. if `messages` is passed), these extra parameters are supported instead:
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-chat-embedding-extra-params
:end-before: end-chat-embedding-extra-params
```
:::
(tokenizer-api)=
### Tokenizer API
Our Tokenizer API is a simple wrapper over [HuggingFace-style tokenizers](https://huggingface.co/docs/transformers/en/main_classes/tokenizer).
......@@ -291,15 +307,17 @@ It consists of two endpoints:
- `/detokenize` corresponds to calling `tokenizer.decode()`.
(pooling-api)=
### Pooling API
Our Pooling API encodes input prompts using a [pooling model](../models/pooling_models.md) and returns the corresponding hidden states.
The input format is the same as [Embeddings API](#embeddings-api), but the output data can contain an arbitrary nested list, not just a 1-D list of floats.
Code example: <gh-file:examples/openai_pooling_client.py>
Code example: <gh-file:examples/online_serving/openai_pooling_client.py>
(score-api)=
### Score API
Our Score API applies a cross-encoder model to predict scores for sentence pairs.
......@@ -307,7 +325,7 @@ Usually, the score for a sentence pair refers to the similarity between two sent
You can find the documentation for these kind of models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).
Code example: <gh-file:examples/openai_cross_encoder_score.py>
Code example: <gh-file:examples/online_serving/openai_cross_encoder_score.py>
#### Single inference
......@@ -445,18 +463,105 @@ Response:
#### Extra parameters
The following [pooling parameters (click through to see documentation)](../dev/pooling_params.md) are supported.
The following [pooling parameters](#pooling-params) are supported.
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-score-pooling-params
:end-before: end-score-pooling-params
```
:::
The following extra parameters are supported:
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-score-extra-params
:end-before: end-score-extra-params
:::
(rerank-api)=
### Re-rank API
Our Re-rank API applies a cross-encoder model to predict relevant scores between a single query, and
each of a list of documents. Usually, the score for a sentence pair refers to the similarity between two sentences, on
a scale of 0 to 1.
You can find the documentation for these kind of models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).
The rerank endpoints support popular re-rank models such as `BAAI/bge-reranker-base` and other models supporting the
`score` task. Additionally, `/rerank`, `/v1/rerank`, and `/v2/rerank`
endpoints are compatible with both [Jina AI's re-rank API interface](https://jina.ai/reranker/) and
[Cohere's re-rank API interface](https://docs.cohere.com/v2/reference/rerank) to ensure compatibility with
popular open-source tools.
Code example: <gh-file:examples/online_serving/jinaai_rerank_client.py>
#### Example Request
Note that the `top_n` request parameter is optional and will default to the length of the `documents` field.
Result documents will be sorted by relevance, and the `index` property can be used to determine original order.
Request:
```bash
curl -X 'POST' \
'http://127.0.0.1:8000/v1/rerank' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "BAAI/bge-reranker-base",
"query": "What is the capital of France?",
"documents": [
"The capital of Brazil is Brasilia.",
"The capital of France is Paris.",
"Horses and cows are both animals"
]
}'
```
Response:
```bash
{
"id": "rerank-fae51b2b664d4ed38f5969b612edff77",
"model": "BAAI/bge-reranker-base",
"usage": {
"total_tokens": 56
},
"results": [
{
"index": 1,
"document": {
"text": "The capital of France is Paris."
},
"relevance_score": 0.99853515625
},
{
"index": 0,
"document": {
"text": "The capital of Brazil is Brasilia."
},
"relevance_score": 0.0005860328674316406
}
]
}
```
#### Extra parameters
The following [pooling parameters](#pooling-params) are supported.
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-rerank-pooling-params
:end-before: end-rerank-pooling-params
:::
The following extra parameters are supported:
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-rerank-extra-params
:end-before: end-rerank-extra-params
:::
......@@ -45,7 +45,7 @@ You can preview the collected data by running the following command:
tail ~/.config/vllm/usage_stats.json
```
## Opt-out of Usage Stats Collection
## Opting out
You can opt-out of usage stats collection by setting the `VLLM_NO_USAGE_STATS` or `DO_NOT_TRACK` environment variable, or by creating a `~/.config/vllm/do_not_track` file:
......
(compatibility-matrix)=
# Compatibility Matrix
The tables below show mutually exclusive features and the support on some hardware.
```{note}
Check the '✗' with links to see tracking issue for unsupported feature/hardware combination.
```
## Feature x Feature
```{raw} html
<style>
/* Make smaller to try to improve readability */
td {
font-size: 0.8rem;
text-align: center;
}
th {
text-align: center;
font-size: 0.8rem;
}
</style>
```
```{list-table}
:header-rows: 1
:stub-columns: 1
:widths: auto
* - Feature
- [CP](#chunked-prefill)
- [APC](#apc)
- [LoRA](#lora-adapter)
- <abbr title="Prompt Adapter">prmpt adptr</abbr>
- [SD](#spec_decode)
- CUDA graph
- <abbr title="Pooling Models">pooling</abbr>
- <abbr title="Encoder-Decoder Models">enc-dec</abbr>
- <abbr title="Logprobs">logP</abbr>
- <abbr title="Prompt Logprobs">prmpt logP</abbr>
- <abbr title="Async Output Processing">async output</abbr>
- multi-step
- <abbr title="Multimodal Inputs">mm</abbr>
- best-of
- beam-search
- <abbr title="Guided Decoding">guided dec</abbr>
* - [CP](#chunked-prefill)
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
* - [APC](#apc)
- ✅
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
* - [LoRA](#lora-adapter)
- [✗](gh-pr:9057)
- ✅
-
-
-
-
-
-
-
-
-
-
-
-
-
-
* - <abbr title="Prompt Adapter">prmpt adptr</abbr>
- ✅
- ✅
- ✅
-
-
-
-
-
-
-
-
-
-
-
-
-
* - [SD](#spec_decode)
- ✅
- ✅
- ✗
- ✅
-
-
-
-
-
-
-
-
-
-
-
-
* - CUDA graph
- ✅
- ✅
- ✅
- ✅
- ✅
-
-
-
-
-
-
-
-
-
-
-
* - <abbr title="Pooling Models">pooling</abbr>
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
-
-
-
-
-
-
-
-
-
-
* - <abbr title="Encoder-Decoder Models">enc-dec</abbr>
- ✗
- [✗](gh-issue:7366)
- ✗
- ✗
- [✗](gh-issue:7366)
- ✅
- ✅
-
-
-
-
-
-
-
-
-
* - <abbr title="Logprobs">logP</abbr>
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✗
- ✅
-
-
-
-
-
-
-
-
* - <abbr title="Prompt Logprobs">prmpt logP</abbr>
- ✅
- ✅
- ✅
- ✅
- [✗](gh-pr:8199)
- ✅
- ✗
- ✅
- ✅
-
-
-
-
-
-
-
* - <abbr title="Async Output Processing">async output</abbr>
- ✅
- ✅
- ✅
- ✅
- ✗
- ✅
- ✗
- ✗
- ✅
- ✅
-
-
-
-
-
-
* - multi-step
- ✗
- ✅
- ✗
- ✅
- ✗
- ✅
- ✗
- ✗
- ✅
- [✗](gh-issue:8198)
- ✅
-
-
-
-
-
* - <abbr title="Multimodal Inputs">mm</abbr>
- ✅
- [✗](gh-pr:8348)
- [✗](gh-pr:7199)
- ?
- ?
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ?
-
-
-
-
* - best-of
- ✅
- ✅
- ✅
- ✅
- [✗](gh-issue:6137)
- ✅
- ✗
- ✅
- ✅
- ✅
- ?
- [✗](gh-issue:7968)
- ✅
-
-
-
* - beam-search
- ✅
- ✅
- ✅
- ✅
- [✗](gh-issue:6137)
- ✅
- ✗
- ✅
- ✅
- ✅
- ?
- [✗](gh-issue:7968>)
- ?
- ✅
-
-
* - <abbr title="Guided Decoding">guided dec</abbr>
- ✅
- ✅
- ?
- ?
- ✅
- ✅
- ✗
- ?
- ✅
- ✅
- ✅
- [✗](gh-issue:9893)
- ?
- ✅
- ✅
-
```
### Feature x Hardware
```{list-table}
:header-rows: 1
:stub-columns: 1
:widths: auto
* - Feature
- Volta
- Turing
- Ampere
- Ada
- Hopper
- CPU
- AMD
* - [CP](#chunked-prefill)
- [✗](gh-issue:2729)
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - [APC](#apc)
- [✗](gh-issue:3687)
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - [LoRA](#lora-adapter)
- ✅
- ✅
- ✅
- ✅
- ✅
- [✗](gh-pr:4830)
- ✅
* - <abbr title="Prompt Adapter">prmpt adptr</abbr>
- ✅
- ✅
- ✅
- ✅
- ✅
- [✗](gh-issue:8475)
- ✅
* - [SD](#spec_decode)
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - CUDA graph
- ✅
- ✅
- ✅
- ✅
- ✅
- ✗
- ✅
* - <abbr title="Pooling Models">pooling</abbr>
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ?
* - <abbr title="Encoder-Decoder Models">enc-dec</abbr>
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✗
* - <abbr title="Multimodal Inputs">mm</abbr>
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - <abbr title="Logprobs">logP</abbr>
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - <abbr title="Prompt Logprobs">prmpt logP</abbr>
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - <abbr title="Async Output Processing">async output</abbr>
- ✅
- ✅
- ✅
- ✅
- ✅
- ✗
- ✗
* - multi-step
- ✅
- ✅
- ✅
- ✅
- ✅
- [✗](gh-issue:8477)
- ✅
* - best-of
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - beam-search
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - <abbr title="Guided Decoding">guided dec</abbr>
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
```
### Quantizer Utilities
`quantize.py`: NVIDIA Quantization utilities using TensorRT-Model-Optimizer, ported
from TensorRT-LLM: [`examples/quantization/quantize.py`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/quantization/quantize.py)
### Prerequisite
#### AMMO (AlgorithMic Model Optimization) Installation: nvidia-ammo 0.7.1 or later
`pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo`
#### AMMO Download (code and docs)
`https://developer.nvidia.com/downloads/assets/cuda/files/nvidia-ammo/nvidia_ammo-0.5.0.tar.gz`
`https://developer.nvidia.com/downloads/assets/cuda/files/nvidia-ammo/nvidia_ammo-0.7.1.tar.gz`
### Usage
#### Run on H100 system for speed if FP8; number of GPUs depends on the model size
#### Example: quantize Llama2-7b model from HF to FP8 with FP8 KV Cache:
`python quantize.py --model-dir ./ll2-7b --dtype float16 --qformat fp8 --kv-cache-dtype fp8 --output-dir ./ll2_7b_fp8 --calib-size 512 --tp-size 1`
Outputs: model structure, quantized model & parameters (with scaling factors) are in JSON and Safetensors (npz is generated only for the reference)
```
# ll ./ll2_7b_fp8/
total 19998244
drwxr-xr-x 2 root root 4096 Feb 7 01:08 ./
drwxrwxr-x 8 1060 1061 4096 Feb 7 01:08 ../
-rw-r--r-- 1 root root 176411 Feb 7 01:08 llama_tp1.json
-rw-r--r-- 1 root root 13477087480 Feb 7 01:09 llama_tp1_rank0.npz
-rw-r--r-- 1 root root 7000893272 Feb 7 01:08 rank0.safetensors
#
```
# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # noqa: E501
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Adapted from examples/quantization/hf_ptq.py
"""
import argparse
import copy
import json
import random
import time
import ammo.torch.quantization as atq
import numpy as np
import torch
from ammo.torch.export import export_model_config
from datasets import load_dataset
from torch.utils.data import DataLoader
from transformers import AutoModelForCausalLM, AutoTokenizer
RAND_SEED = 1234
MAX_SEQ_LEN = 2048
EMPTY_CFG = {
"quant_cfg": {
"*weight_quantizer": {
"enable": False,
},
"*input_quantizer": {
"enable": False
},
"*lm_head*": {
"enable": False
},
"*output_layer*": {
"enable": False
},
"default": {
"enable": False
},
},
"algorithm": "max",
}
KV_CACHE_CFG = {
"*.query_key_value.output_quantizer": {
"num_bits": 8,
"axis": None,
"enable": True
},
"*.Wqkv.output_quantizer": {
"num_bits": 8,
"axis": None,
"enable": True
},
"*.W_pack.output_quantizer": {
"num_bits": 8,
"axis": None,
"enable": True
},
"*.c_attn.output_quantizer": {
"num_bits": 8,
"axis": None,
"enable": True
},
"*.k_proj.output_quantizer": {
"num_bits": 8,
"axis": None,
"enable": True
},
"*.v_proj.output_quantizer": {
"num_bits": 8,
"axis": None,
"enable": True
},
}
QUANT_CFG_CHOICES = {
"int8_sq": atq.INT8_SMOOTHQUANT_CFG,
"fp8": atq.FP8_DEFAULT_CFG,
"int4_awq": atq.INT4_AWQ_CFG,
"w4a8_awq": atq.W4A8_AWQ_BETA_CFG,
"int8_wo": EMPTY_CFG,
"int4_wo": EMPTY_CFG,
"full_prec": EMPTY_CFG,
}
MODEL_NAME_PATTERN_MAP = {
"GPT2": "gpt2",
"Xverse": "llama",
"Llama": "llama",
"Mistral": "llama",
"GPTJ": "gptj",
"FalconForCausalLM": "falcon",
"RWForCausalLM": "falcon",
"baichuan": "baichuan",
"MPT": "mpt",
"Bloom": "bloom",
"ChatGLM": "chatglm",
"QWen": "qwen",
}
def get_tokenizer(ckpt_path, max_seq_len=MAX_SEQ_LEN, model_type=None):
print(f"Initializing tokenizer from {ckpt_path}")
tokenizer = AutoTokenizer.from_pretrained(
ckpt_path,
model_max_length=max_seq_len,
padding_side="left",
trust_remote_code=True,
)
if model_type and model_type == "qwen":
# qwen use token id 151643 as pad and eos tokens
tokenizer.pad_token = tokenizer.convert_ids_to_tokens(151643)
tokenizer.eos_token = tokenizer.convert_ids_to_tokens(151643)
# can't set attribute 'pad_token' for "<unk>"
if tokenizer.pad_token != "<unk>":
tokenizer.pad_token = tokenizer.eos_token
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
assert (tokenizer.pad_token
is not None), f"Pad token for {model_type} cannot be set!"
return tokenizer
def get_model(ckpt_path, dtype="fp16", device="cuda"):
print(f"Initializing model from {ckpt_path}")
if dtype == "bf16" or dtype == "bfloat16":
dtype = torch.bfloat16
elif dtype == "fp16" or dtype == "float16":
dtype = torch.float16
elif dtype == "fp32" or dtype == "float32":
dtype = torch.float32
else:
raise NotImplementedError(f"Unknown dtype {dtype}")
# model_kwargs = {"torch_dtype": dtype}
model_kwargs = {"torch_dtype": "auto"}
model = AutoModelForCausalLM.from_pretrained(ckpt_path,
device_map="auto",
**model_kwargs,
trust_remote_code=True)
model.eval()
model_dtype = next(model.parameters()).dtype
if dtype != model_dtype:
print("[TensorRT-LLM][WARNING] The manually set model data type is "
f"{dtype}, but the data type of the HuggingFace model is "
f"{model_dtype}.")
return model
def get_model_type(model):
for k, v in MODEL_NAME_PATTERN_MAP.items():
if k.lower() in type(model).__name__.lower():
return v
return None
def get_calib_dataloader(data="cnn_dailymail",
tokenizer=None,
batch_size=1,
calib_size=512,
block_size=512,
device=None):
print("Loading calibration dataset")
if data == "pileval":
dataset = load_dataset(
"json",
data_files="https://the-eye.eu/public/AI/pile/val.jsonl.zst",
split="train")
dataset = dataset["text"][:calib_size]
elif data == "cnn_dailymail":
dataset = load_dataset("cnn_dailymail", name="3.0.0", split="train")
dataset = dataset["article"][:calib_size]
else:
raise NotImplementedError
batch_encoded = tokenizer.batch_encode_plus(dataset,
return_tensors="pt",
padding="max_length",
truncation=True,
max_length=block_size)
if device:
batch_encoded = batch_encoded.to(device)
batch_encoded = batch_encoded["input_ids"]
calib_dataloader = DataLoader(batch_encoded,
batch_size=batch_size,
shuffle=False)
return calib_dataloader
def quantize_model(model, quant_cfg, calib_dataloader=None):
def calibrate_loop():
if calib_dataloader is None:
return
"""Adjusts weights and scaling factors based on selected algorithms."""
for idx, data in enumerate(calib_dataloader):
print(f"Calibrating batch {idx}")
model(data)
print("Starting quantization...")
start_time = time.time()
atq.quantize(model, quant_cfg, forward_loop=calibrate_loop)
end_time = time.time()
print("Quantization done. Total time used: {:.2f} s.".format(end_time -
start_time))
return model
def main(args):
if not torch.cuda.is_available():
raise OSError("GPU is required for inference.")
random.seed(RAND_SEED)
np.random.seed(RAND_SEED)
model = get_model(args.model_dir, args.dtype, args.device)
model_type = get_model_type(model)
tokenizer = get_tokenizer(args.model_dir, model_type=model_type)
if args.qformat in ["full_prec", "int8_wo", "int4_wo"
] and args.kv_cache_dtype is None:
print(f"No quantization applied, export {args.dtype} model")
else:
if "awq" in args.qformat:
if args.calib_size > 32:
print("AWQ calibration could take longer with calib_size = "
f"{args.calib_size}, Using calib_size=32 instead")
args.calib_size = 32
print("\nAWQ calibration could take longer than other calibration "
"methods. Please increase the batch size to speed up the "
"calibration process. Batch size can be set by adding the "
"argument --batch_size <batch_size> to the command line.\n")
calib_dataloader = get_calib_dataloader(
tokenizer=tokenizer,
batch_size=args.batch_size,
calib_size=args.calib_size,
device=args.device,
)
if args.qformat in QUANT_CFG_CHOICES:
quant_cfg = QUANT_CFG_CHOICES[args.qformat]
else:
raise ValueError(
f"Unsupported quantization format: {args.qformat}")
if "awq" in args.qformat:
quant_cfg = copy.deepcopy(QUANT_CFG_CHOICES[args.qformat])
weight_quantizer = quant_cfg["quant_cfg"][
"*weight_quantizer"] # type: ignore
if isinstance(weight_quantizer, list):
weight_quantizer = weight_quantizer[0]
weight_quantizer["block_sizes"][-1] = args.awq_block_size
if args.kv_cache_dtype is not None:
if args.kv_cache_dtype == "fp8":
for value in KV_CACHE_CFG.values():
value.update({"num_bits": (4, 3)}) # type: ignore
quant_cfg["quant_cfg"].update(KV_CACHE_CFG) # type: ignore
print(quant_cfg)
model = quantize_model(model, quant_cfg, calib_dataloader)
with torch.inference_mode():
if model_type is None:
print(f"Unknown model type {type(model).__name__}. Continue "
"exporting...")
model_type = f"unknown:{type(model).__name__}"
export_path = args.output_dir
start_time = time.time()
if args.qformat == "int4_awq" and model_type == "qwen":
torch.save(model.state_dict(), export_path)
else:
export_npz = (model_type not in [
'gptj', 'falcon', 'chatglm', 'mpt', 'llama', 'baichuan'
])
# export safetensors
export_model_config(
model,
model_type,
getattr(torch, args.dtype),
export_dir=export_path,
inference_tensor_parallel=args.tp_size,
inference_pipeline_parallel=args.pp_size,
# export_tensorrt_llm_config=(not export_npz),
export_tensorrt_llm_config=False,
export_npz=export_npz)
# Workaround for wo quantization
if args.qformat in ["int8_wo", "int4_wo", "full_prec"]:
with open(f"{export_path}/config.json") as f:
tensorrt_llm_config = json.load(f)
if args.qformat == "int8_wo":
tensorrt_llm_config["quantization"]["quant_algo"] = 'W8A16'
elif args.qformat == "int4_wo":
tensorrt_llm_config["quantization"]["quant_algo"] = 'W4A16'
else:
tensorrt_llm_config["quantization"]["quant_algo"] = None
with open(f"{export_path}/config.json", "w") as f:
json.dump(tensorrt_llm_config, f, indent=4)
end_time = time.time()
print("Quantized model exported to {} \nTotal time used {:.2f} s.".
format(export_path, end_time - start_time))
if __name__ == "__main__":
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--model-dir",
help="Specify where the HuggingFace model is",
required=True)
parser.add_argument("--device", default="cuda")
parser.add_argument("--dtype", help="Model data type.", default="float16")
parser.add_argument(
"--qformat",
help="Quantization format.",
default="full_prec",
choices=[
"fp8", "int8_sq", "int4_awq", "w4a8_awq", "int8_wo", "int4_wo",
"full_prec"
],
)
parser.add_argument("--batch-size",
help="Batch size for calibration.",
type=int,
default=1)
parser.add_argument("--calib-size",
help="Number of samples for calibration.",
type=int,
default=512)
parser.add_argument("--output-dir", default="exported_model")
parser.add_argument("--tp-size", type=int, default=1)
parser.add_argument("--pp-size", type=int, default=1)
parser.add_argument("--awq-block-size", type=int, default=128)
parser.add_argument("--kv-cache-dtype",
help="KV Cache dtype.",
default=None,
choices=["int8", "fp8", None])
args = parser.parse_args()
main(args)
......@@ -67,7 +67,37 @@ def run_qwen2_audio(question: str, audio_count: int):
return llm, prompt, stop_token_ids
model_example_map = {"ultravox": run_ultravox, "qwen2_audio": run_qwen2_audio}
def run_minicpmo(question: str, audio_count: int):
model_name = "openbmb/MiniCPM-o-2_6"
tokenizer = AutoTokenizer.from_pretrained(model_name,
trust_remote_code=True)
llm = LLM(model=model_name,
trust_remote_code=True,
max_model_len=4096,
max_num_seqs=5,
limit_mm_per_prompt={"audio": audio_count})
stop_tokens = ['<|im_end|>', '<|endoftext|>']
stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]
audio_placeholder = "(<audio>./</audio>)" * audio_count
audio_chat_template = "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n<|spk_bos|><|spk|><|spk_eos|><|tts_bos|>' }}{% endif %}" # noqa: E501
messages = [{
'role': 'user',
'content': f'{audio_placeholder}\n{question}'
}]
prompt = tokenizer.apply_chat_template(messages,
tokenize=False,
add_generation_prompt=True,
chat_template=audio_chat_template)
return llm, prompt, stop_token_ids
model_example_map = {
"ultravox": run_ultravox,
"qwen2_audio": run_qwen2_audio,
"minicpmo": run_minicpmo
}
def main(args):
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment